Dennis Muchiri

Posted on Aug 11

Core Concepts in Data Engineering

Essential concepts in Data Engineering

Batch processing

Batch processing refers to the processing of a high volume of data in a batch within a specific time span. It processes large volumes of data all at once. Batch processing is used when data size is known and finite. It takes a little longer time to process data. It requires dedicated staff to handle issues. A batch processor processes data in multiple passes. When data is collected over time and similar data batched/grouped together then in that case batch processing is used.

Challenges With Batch processing
Debugging these systems is difficult as it requires dedicated professionals to fix the error.
Software and training require high expenses initially just to understand batch scheduling, triggering, notification, etc.

Advantages of Batch Processing

Efficiency in Handling Large Volumes: Batch processing is very efficient when handling big volumes of data because it combines the data and process it at once.

Reduced Costs: As the processing is in mass, it isn’t very intensive and in some cases can be done outside the business hours and many a times saves the expenses.

Simplified Error Handling: Batch processing errors are also easy to correct since the data is processed as a batch and an audit performed.

Disadvantages of Batch Processing

Delayed Results: I want to note that this approach is good only for those tasks where the processing is done with a considerable time delay.

Inflexibility: When a particular batch job is initiated, it becomes a bit difficult to introduce an alteration or provide means to process the new inputs until the current batch has been processed.

Stream processing
Stream processing refers to processing of continuous stream of data immediately as it is produced. It analyzes streaming data in real time. Stream processing is used when the data size is unknown and infinite and continuous. It takes few seconds or milliseconds to process data. In stream processing data output rate is as fast as data input rate. Stream processor processes data in few passes. When data stream is continuous and requires immediate response then in that case stream processing is used.

Challenges with Stream processing
Data input rate and output rate sometimes creates a problem.
Cope with huge amount of data and immediate response.

Advantages of Stream Processing

Real-Time Processing: Real time processing is made possible by stream processing, which outputs both results and actions.

Continuous Data Handling: The real-time processing is ideal for the set-up where there is a constant stream of data that need to be analyzed as soon as possible.

Scalability: The variability of data flooding can be managed in stream processing systems, thus making them effective for large scale data systems.

Disadvantages of Stream Processing

Complexity: Stream processing systems, in its totality, is a complex area to implement and manage thus needs special skills.

Higher Costs: Real-time processing requires more computer power thus be expensive as compared to batch processing.

Change Data Capture (CDC)

Change Data Capture (CDC) is a method used in databases to track and record changes made to data. It captures modifications like inserts, updates, and deletes, and stores them for analysis or replication. CDC helps maintain data consistency across different systems by keeping track of alterations in real-time. It's like having a digital detective that monitors changes in a database and keeps a log of what happened and when.

Importance of Change Data Capture (CDC)
Change Data Capture (CDC) holds immense importance in facilitating real-time data synchronization and powering event-driven architectures.

Real-time Data Synchronization: CDC captures and propagates data changes as they occur, ensuring that all connected systems remain updated in real-time. This is crucial for scenarios where multiple systems or databases need to stay synchronized without delays, enabling seamless data sharing and consistency across the ecosystem.

Event-Driven Architectures: CDC serves as a cornerstone for event-driven architectures, where actions are triggered by events or changes in the system. By capturing data changes as events, CDC enables systems to react dynamically to these changes, initiating relevant processes or workflows in real time. This results in more responsive and agile systems that can adapt to changing conditions or requirements instantly.

Efficient Data Processing: CDC minimizes the need for manual intervention or batch processing by continuously streaming data changes. This leads to more efficient data processing pipelines, reducing latency and ensuring that downstream systems have access to the latest information without waiting for scheduled updates.

Scalability and Flexibility: With CDC, event-driven architectures can scale easily to handle increasing data volumes and accommodate evolving business needs. By decoupling components and leveraging asynchronous communication, CDC enables systems to scale horizontally while maintaining responsiveness and reliability.

Enhanced Analytics and Insights:Real-time data synchronization facilitated by CDC enables organizations to derive insights from up-to-date data, driving informed decision-making and enabling timely actions. By integrating CDC with analytics platforms, organizations can gain immediate visibility into trends, patterns, and anomalies, empowering them to respond swiftly to changing market conditions or customer behaviors.

Working principles on CDC

Capture: CDC captures changes made to data in a source system, including inserts, updates, and deletes, without affecting the source's performance.

Log-based Tracking: It leverages database transaction logs or replication logs to identify and extract data changes, ensuring accurate and reliable capture.

Incremental Updates: Instead of transferring entire datasets, CDC focuses on transmitting only the changed data, minimizing network bandwidth and processing overhead.

Real-time or Near Real-time: CDC operates in real-time or near real-time, ensuring that data changes are propagated to target systems promptly, maintaining data freshness.

Idempotent Processing: CDC processes changes in an idempotent manner, ensuring that duplicate changes do not result in unintended side effects or data inconsistencies.

Use Cases of Change Data Capture (CDC)

Data Warehousing: CDC is used to replicate data from transactional databases to data warehouses, ensuring that analytical systems have access to the latest operational data for reporting and analysis.

Replication: CDC facilitates database replication across geographically distributed environments, enabling disaster recovery, data distribution, and load balancing.

**Data Integration: **CDC enables seamless data integration between heterogeneous systems, supporting scenarios such as integrating legacy systems with modern applications or synchronizing data between cloud and on-premises environments.

**Real-time Analytics: **CDC powers real-time analytics platforms by continuously feeding data changes into analytical systems, enabling organizations to derive insights from fresh data and respond swiftly to changing conditions.

Data Synchronization: CDC ensures data consistency across multiple systems by synchronizing data changes in real-time, supporting scenarios such as synchronization between operational databases and caching layers or between microservices in distributed architectures.

Applications of Change Data Capture (CDC)

Financial Services: CDC is used in financial services for real-time fraud detection, risk management, and compliance monitoring by capturing and analyzing transactional data changes in real time.

E-commerce: In e-commerce, CDC enables real-time inventory management, order processing, and personalized marketing by synchronizing data changes across multiple systems, such as inventory databases, order management systems, and customer relationship management (CRM) platforms.

Idempotency

Idempotency is a property of operations or API requests that ensures repeating the operation multiple times produces the same result as executing it once.
Safe methods are idempotent but not all idempotent methods are safe.
HTTP methods like GET, HEAD, PUT, DELETE, OPTIONS, and TRACE are idempotent, while POST and PATCH are generally non-idempotent.

OLAP

OLAP (Online Analytical Processing) and OLTP (Online Transaction Processing) are both integral parts of data management, but they have different functionalities.

OLTP focuses on handling large numbers of transactional operations in real time, ensuring data consistency and reliability for daily business operations.

OLAP is designed for complex queries and data analysis, enabling businesses to derive insights from vast datasets through multidimensional analysis.
Let's learn about the differences between them in detail:

Online Analytical Processing (OLAP)

Online Analytical Processing (OLAP) refers to software tools used for the analysis of data in business decision-making processes. OLAP systems generally allow users to extract and view data from various perspectives, many times they do this in a multidimensional format which is necessary for understanding complex interrelations in the data. These systems are part of data warehousing and business intelligence, enabling users to do things like trend analysis, financial forecasting, and any other form of in-depth data analysis.

OLAP Examples

Any type of Data Warehouse System is an OLAP system. The uses of the OLAP System are described below.

Spotify personalizes homepages with custom songs and playlists based on user preferences.
Netflix movie recommendation system.

Online Transaction Processing (OLTP)

Online Transaction Processing, commonly known as OLTP, is a data processing approach emphasizing real-time execution of transactions. The majority of OLTP systems are meant to manage numerous short atomic operations that keep databases in line. To maintain transaction integrity and reliability, these systems support ACID (Atomicity, Consistency, Isolation, Durability) properties. It is through this that numerous unavoidable applications run their critical courses like online banking, reservation systems etc.

OLTP Examples

An example considered for OLTP System is ATM Center a person who authenticates first will receive the amount first and the condition is that the amount to be withdrawn must be present in the ATM.

ATM center is an OLTP application.
OLTP handles the ACID properties during data transactions via the application.
It's also used for Online banking, Online airline ticket booking, sending a text message, add a book to the shopping cart.
**
Benefits of OLTP Services**
Allow users to quickly read, write, and delete data operations.
Support an increase in users and transactions for real-time data access.
Provide better data protection through multiple security features.
Aid in decision-making with accurate, up-to-date data.
Ensure data integrity, consistency, and high availability.

Drawbacks of OLTP Services

Limited analysis capability, not suited for complex analysis or reporting.
High maintenance costs due to frequent updates, backups, and recovery.

Columnar vs Row-based Storage

Column oriented

In a column-oriented data store, data is organized and stored by columns rather than by rows. This approach is optimized for retrieving specific columns of data and is typically used in data warehousing and analytics systems.

Advantages of Column-Oriented Databases
Designed with Online Analytical Processing (OLAP) in mind:

Perfect for analytical queries requiring operations and aggregates on certain columns.
Improved query performance and greater compression are made possible by storage efficiency.
Faster Query Performance: Because just the relevant data is read, queries that only call for a subset of columns run more quickly.
**
Disadvantages of Column-Oriented Databases**
Complicated Row Retrieval: When data is dispersed across many columns, retrieving full rows may be more difficult.
Less Appropriate for OLTP: Transactional workloads involving frequent insert, update, or delete operations are less efficient using this approach.
Query Complexity: To optimize performance, certain query languages and optimization strategies are needed.

Row-based

In a row-oriented data store, data is stored and retrieved row-by-row, meaning that all of the attributes of a particular row are stored together in the same physical block of data. This approach is optimized for retrieving entire rows of data at a time and is typically used in traditional RDBMS systems.

Advantages of Row-Oriented Databases

Effective in Online Transaction Processing (OLTP): Ideal for programs that execute insert, update, and delete commands often.
Easy to Use: Those who are acquainted with conventional relational databases will find it simple to comprehend and use.
Full Row Retrieval: Effective for queries requiring full row access, such getting every information about a single object.

Disadvantages of Row-Oriented Databases

Not effective for analytics reduced query speed for analytical queries that just need certain columns.
Storage inefficiency: Because data is stored row-by-row without compression, it may need extra storage space.
Limitations of Scaling: As data size grows, scaling might become more difficult.

Data Partitioning

Data partitioning is a technique for dividing large datasets into smaller, manageable chunks called partitions. Each partition contains a subset of data and is distributed across multiple nodes or servers. These partitions can be stored, queried, and managed as individual tables, though they logically belong to the same dataset.

Data partitioning improves database performance and scalability. For instance, searching for a data point in the entire table takes longer and uses more resources than searching for it in a specific partition. That's why data is stored as partitions.

ETL vs ELT

ELT Process
Extraction, Load and Transform (ELT) is the technique of extracting raw data from the source, storing it in the data warehouse of the target server and preparing it for end-stream users.

ELT consists of three different operations performed on the data:

Extract: Extracting data is the process of identifying data from one or more sources. The sources may include databases, files, ERP, CRM, or any other useful source of data.

Load: Loading is the process of storing the extracted raw data in a data warehouse or data lake.

Transform: Data transformation is the process in which the raw data from the source is transformed into the target format required for analysis

ETL Process

ETL is the traditional technique of extracting raw data, transforming it as required for the users and storing it in data warehouses. ELT was later developed, with ETL as its base. The three operations in ETL and ELT are the same, except that their order of processing is slightly different. This change in sequence was made to overcome some drawbacks.

Extract: It is the process of extracting raw data from all available data sources such as databases, files, ERP, CRM or any other.

Transform: The extracted data is immediately transformed as required by the user.

Load: The transformed data is then loaded into the data warehouse from where the users can access it.

CAP Theorem

The CAP theorem emphasizes the limitations that system designers have while addressing distributed data replication. It states that only two of the three properties—consistency, availability, and partition tolerance—can be concurrently attained by a distributed system.

Developers must carefully balance these attributes according to their particular application demands because of this underlying restriction. Designers may decide which qualities to prioritize to obtain the best performance and reliability for their systems by knowing the CAP theorem.

Windowing in Streaming

Windowing is a way of grouping events into a set of time-based collections or windows. These windows can be specified based on time intervals, number of records, or some other criteria. The aim of windowing is to allow stream processing applications to break down continuous data streams into manageable chunks for processing and analysis.

Windowing is often used in stream processing applications to implement operations like aggregations, filtering, and transformation. By breaking down the stream into manageable chunks, it becomes easier to carry out operations that require knowledge of the state of the stream over time.

DAGs and Workflow Orchestration

DAG is an airflow concept that collects tasks & organizes them with dependencies and linkages specifying how they should run.

DAG properties:

DAGs are defined in python files in the “DAG” folder
Transitive closure & transitive reduction are defined differently in DAGs.
Each node receives a string of ids that it uses as labels for storing calculated values.
dag_id is the unique value that identifies each DAG.
start_date defines when the DAG should start.
dag_args are a dictionary of variables to be used as constructor keyword parameters when initializing parameters.

Workflow orchestration:
refers to the process of coordinating and automating complex workflows consisting of multiple interconnected tasks.

Workflow orchestration involves;

designing and defining the workflows
scheduling & executing the tasks
monitoring progress & outcomes
handling any errors or exceptions that may arise
Examples of workflow orchestration tools;

Apache Airflow: a platform for developing, scheduling & monitoring batch-oriented workflows defined as DAGs in python.
Dagster: this tool focuses on data-aware pipelines with a strong emphasis on data engineering practices.
Temporal: this tool is designed for fault tolerant execution of long running business processes.
Argo Workflows: is a high performance, container-native engine for orchestrating parallel jobs on Kubernetes.

Retry Logic & Dead Letter Queues

Dead letter Queues
A dead-letter queue (DLQ) is a special type of message queue that temporarily stores messages that a software system cannot process due to errors. Message queues are software components that support asynchronous communication in a distributed system. They let you send messages between software services at any volume and don’t require the message receiver to always be available. A dead-letter queue specifically stores erroneous messages that have no destination or which can’t be processed by the intended receiver.

Retry Logic

The Retry Pattern in microservices system design is a strategy used to handle temporary failures that occur during service communication. In a distributed system where microservices interact with each other, transient issues such as network interruptions or brief service outages can lead to failed requests. To address these transient failures, the Retry Pattern involves automatically retrying a failed request a predetermined number of times before considering it a permanent failure.

This approach helps ensure that brief disruptions do not result in service failures or degraded performance.

The pattern typically includes configurable parameters, such as the number of retries and the delay between attempts, often incorporating exponential backoff and jitter to prevent overwhelming the system with retries.

By implementing the Retry Pattern, microservices can maintain higher reliability and resilience, improving the overall robustness of the system.

Benefits of the Retry Pattern in Microservices
The Retry Pattern offers several key benefits in microservices architecture:

Increased Reliability: By automatically retrying failed requests, the Retry Pattern helps ensure that transient issues do not result in service failures or degraded performance. This enhances the overall reliability of the system.

Improved Fault Tolerance: The pattern allows services to handle temporary failures gracefully, reducing the impact of short-term problems like network glitches or brief outages.

Enhanced User Experience: Users experience fewer disruptions as temporary failures are managed automatically. This leads to smoother interactions and a more robust service delivery.

**Reduced Manual Intervention: **Automated retries reduce the need for manual error handling and intervention, streamlining operations and improving efficiency.

Backfilling & Reprocessing

*Backfilling *
refers to the process of retroactively filling in missing or correcting incorrect data in a pipeline or dataset. In simpler terms, if some historical data wasn’t processed correctly (or at all) the first time, backfilling allows you to reprocess that past data. This ensures your data warehouse or reports reflect a consistent and complete history.

There are many situations that lead teams to perform a backfill. Common scenarios include data gaps caused by system failures or pipeline bugs, schema or logic changes that require recalculating past results, onboarding of new data sources where historical data needs to be integrated, and regulatory or analytics needs that demand complete historical records. In all these cases, backfilling is essential to maintain data integrity.

Importances of backfilling:

Data consistency: Filling in gaps or correcting past errors keeps your datasets consistent over time. Without backfills, you might have days or partitions of data that are out of sync or missing critical fields, leading to inconsistencies.

Error correction: If a bug in a pipeline caused incorrect calculations or dropped data in the past, a backfill can re-run the logic on historical data to fix those mistakes. This ensures that historical records are accurate and trustworthy.

Data Reprocessing
Data reprocessing refers to the process of re-running a data processing pipeline on existing data, typically to correct errors, update information, or incorporate new logic. This can involve recalculating results, refreshing data based on new information, or applying updated rules or algorithms

Time Travel & Data Versioning

Data Versioning

Data versioning is a critical aspect of data management in cloud computing. It allows for the tracking and control of changes made to data objects, facilitating data recovery and ensuring data integrity. Each version of a data object represents a snapshot of that object at a specific point in time, providing a historical record of the object's state.

Versioning is particularly useful in scenarios where multiple users or applications are modifying the same data object. It allows for the resolution of conflicts and the prevention of data loss due to overwrites. Furthermore, versioning enables the rollback of changes, providing a safety net in case of errors or unwanted modifications.

Time Travel

Time travel in cloud computing refers to the ability to view and manipulate data as it existed at any point in the past. This is achieved by maintaining a historical record of all changes made to the data. Time travel allows for the recovery of lost data, the auditing of changes, and the analysis of data trends over time.

Some cloud-based data platforms provide time travel as a built-in feature, allowing users to query past states of the data without the need for manual version management. This can be particularly useful in scenarios involving data analysis and auditing, where understanding the historical state of the data is crucial.

Data Governance

Data governance is the data management discipline that focuses on the quality, security and availability of an organization’s data. Data governance helps ensure data integrity and data security by defining and implementing policies, standards and procedures for data collection, ownership, storage, processing and use.

Distributed computing

Distributed computing refers to a system where processing and data storage is distributed across multiple devices or systems, rather than being handled by a single central device. In a distributed system, each device or system has its own processing capabilities and may also store and manage its own data. These devices or systems work together to perform tasks and share resources, with no single device serving as the central hub.

One example of a distributed computing system is a cloud computing system, where resources such as computing power, storage, and networking are delivered over the Internet and accessed on demand. In this type of system, users can access and use shared resources through a web browser or other client software.

Components
There are several key components of a Distributed Computing System

Devices or Systems: The devices or systems in a distributed system have their own processing capabilities and may also store and manage their own data.

Network: The network connects the devices or systems in the distributed system, allowing them to communicate and exchange data.

Resource Management: Distributed systems often have some type of resource management system in place to allocate and manage shared resources such as computing power, storage, and networking.

The architecture of a Distributed Computing System is typically a Peer-to-Peer Architecture, where devices or systems can act as both clients and servers and communicate directly with each other.

Characteristics
There are several characteristics that define a Distributed Computing System

Multiple Devices or Systems: Processing and data storage is distributed across multiple devices or systems.

Peer-to-Peer Architecture: Devices or systems in a distributed system can act as both clients and servers, as they can both request and provide services to other devices or systems in the network.

Shared Resources: Resources such as computing power, storage, and networking are shared among the devices or systems in the network.

Horizontal Scaling: Scaling a distributed computing system typically involves adding more devices or systems to the network to increase processing and storage capacity. This can be done through hardware upgrades or by adding additional devices or systems to the network.

Advantages
Some Advantages of the Distributed Computing System are:

Scalability: Distributed systems are generally more scalable than centralized systems, as they can easily add new devices or systems to the network to increase processing and storage capacity.
Reliability: Distributed systems are often more reliable than centralized systems, as they can continue to operate even if one device or system fails.
Flexibility: Distributed systems are generally more flexible than centralized systems, as they can be configured and reconfigured more easily to meet changing computing needs.
There are a few limitations to Distributed Computing System

Complexity: Distributed systems can be more complex than centralized systems, as they involve multiple devices or systems that need to be coordinated and managed.
Security: It can be more challenging to secure a distributed system, as security measures must be implemented on each device or system to ensure the security of the entire system.
Performance: Distributed systems may not offer the same level of performance as centralized systems, as processing and data storage is distributed across multiple devices or systems.

Applications
Distributed Computing Systems have a number of applications, including:

Cloud Computing: Cloud Computing systems are a type of distributed computing system that are used to deliver resources such as computing power, storage, and networking over the Internet.

Peer-to-Peer Networks: Peer-to-Peer Networks are a type of distributed computing system that is used to share resources such as files and computing power among users.

Distributed Architectures: Many modern computing systems, such as microservices architectures, use distributed architectures to distribute processing and data storage across multiple devices or systems.