George

Posted on Aug 11

Key Concepts Every Data Engineer Should Master

#datascience #dataengineering #olap #partitioning

Imagine you’ve been recently recruited at a fast-paced startup. You start interacting with systems that track millions of users, processes order in real time, and generate instant insights. You come across a whirlwind of jargon with terms such as batch and stream processing, OLAP vs OLTP, data partitioning, and workflow orchestration. Slowly by slowly, you grasp each concept as they keep the business informed, agile, and ahead of the competition.

This article will walk you through these core data engineering concepts, turning abstract terms into practical tools you can apply in the real world.

Batch vs Stream processing

Batch processing is a method that involves collecting and processing large volumes of data in predefined, fixed-size chunks or batches.

Batch processing is useful when you need to process large volumes of data at once instead of handling each record as it arrives.

Some of its applicability include calculating salaries and employees’ benefits at the end of the month, processing daily point-of-sale to identify top-selling products, and analyzing production data at the end of each shift.

On the other hand, stream processing is a data processing approach designed to handle and analyze data in real-time as it flows through a system. The real-time data processing encompasses capturing data continuously and incrementally.

The case scenarios that involve real-time processing entails live-analytics for financial markets, network traffic monitoring and triggering alerts for fraud detection.

Change Data Capture (CDC)

Change Data Capture (CDC) is a data integration approach that detects and records modifications made to a source database, then transfers those changes in real time to a target system such as a data warehouse.

An example of CDC is seen when an e-commerce organization updates its customer database whenever an order is placed or updated. CDC allows these changes to be captured immediately and sent to the firm’s Data warehouse to allow analysts track sales in near real time.

Idempotency

Idempotency is the property that warrants an operation will produce a similar outcome regardless of how many times it’s executed. Idempotency is useful in data pipelines since upstream data might be sent multiple times sometimes due to retries or errors.

Designing idempotent processes support consistent data and avoid causing duplicates or errors.

OLTP vs OLAP

OLTP (Online Transaction Processing) manages high volumes of short and real-time transactions such as online booking as prioritizes speed, accuracy and data integrity via _ACID _properties.

OLAP (Online Analytical Processing) focuses on complex and multidimensional queries for decision-making enabling trend analysis, forecasting, and deep insights from large datasets.

OLTP support daily operations effectively while OLAP empowers strategic planning by scrutinizing historical data though its costly and updated less frequently.

OLTP is operational while OLAP is analytical.

Columnar vs Row-based Storage

Row-based storage read and write data row by row making them suitable for transactional workloads and complex queries.

It’s great for transactional processes (OLTP), faster to read/write entire rows and simple to insert or update records.

However, it’s less efficient for queries that requires only a few columns from many rows and also requires a larger storage space.

Columnar Storage organize data by column/field making it easier to aggregate data and perform calculations easily.

It’s ideal for analytical processes (OLAP), has an efficient compression and storage and support faster reads when querying specific columns across many rows. Its limitations are that its slower for row-based updates or inserts and more sophisticated to implement.

Data Partitioning

Data partitioning refers to the process of splitting a large dataset into smaller, increasingly manageable pieces called partitions. Each partition holds a portion of the data and can be stored or processed separately, usually across multiple servers or nodes.

Partitioning boosts performance and scalability since queries can only target the relevant partition rather than scanning the entire dataset and this makes data retrieval faster and more efficient.

ETL/ELT

ELT (Extract, Load, Transform) loads raw data into a data warehouse first then transforms it as required. ELT is faster and appropriate for large and diverse datasets and cloud environments.

ETL (Extract, Transform, Load) transforms data before loading it since its ideal for smaller and structured data but its slower and less flexible.

ELT supports data lakes and provides cost-efficient scalability while ETL needs dedicated infrastructure and custom security. ELT is most suitable for flexibility and volume while ETL is needed for immediate transformation and smaller datasets.

CAP Theorem

The CAP Theorem is a fundamental concept in distributed systems theory introduced by Eric Brewer.

The CAP Theorem states that a distributed data system can warrant only two of the three properties simultaneously:

Consistency (C): Every node sees the same, most recent data after a write.

Availability (A): Every request receives a response, even during failures.

Partition Tolerance (P): The system continues operating despite network splits between nodes.

Windowing in Streaming

Windowing is a technique in stream processing that breaks continuous, infinite data streams into smaller and manageable chunks called windows.

Rather than processing the entire stream at once, (which is impractical due to its unbounded nature), windowing allows computations over data collected during specific time frames.

Real-life application: Imagine a ride-sharing app like Bolt that tracks driver locations continuously. Using windowing, the system can process location updates every 5 minutes (a time window) to calculate metrics such as average speed or driver availability in that time frame. This allows the app to offer timely and relevant information without waiting for the entire data stream to end.

DAGs and Workflow Orchestration

A DAG (Directed Acyclic Graph) represents a sequence of operations or tasks that need to be executed in a specific order, without any loops or cycles in the dependencies.

Workflow orchestration refers to the automated management and scheduling of these tasks in a DAG, handling execution, retries, and dependencies.

Tools such as Apache Airflow use DAGs to define complex data pipelines to ensure each step runs smoothly and in the correct sequence.

Retry Logic and Dead Letter Queues

Retry logic is a method used to handle failures in processing messages or tasks while dealing with unreliable external services. The system uses exponential backoff to retry the operation multiple times instead of failing immediately.

At times, retries fail repeatedly due to issues such as corrupt messages or prolonged outages. In the process, messages are moved to a Dead Letter Queue (DLQ), a special queue that stores these “poison” or unprocessable messages separately.

DLQs allow developers to isolate problematic messages for later analysis without disrupting the main workflow. Retry Logic and DLQs ensures system reliability and robustness in failure handling.

Backfilling & Reprocessing

Backfilling is the process of filling in missing historical data in a data pipeline or warehouse. Backfilling run jobs on past data to ensure completeness and consistency when new pipelines or transformations are deployed or when data gaps are discovered.

Reprocessing means rerunning data processing tasks over existing datasets often to correct errors, apply updated logic, or incorporate fixes. Reprocessing can entail redoing all or part of the data irrespective of missing data.

Backfilling and reprocessing are vital for upholding accurate and reliable datasets and acclimating to changes or errors in data workflows.

Data governance

Data governance is everything you do to ensure data is secure, private, accurate, available, and usable. It includes the actions people must take, the processes they must follow, and the technology that supports them throughout the data life cycle.

Time Travel & Data Versioning

The concept of time travel enhances querying or restoring the exact state of your data as it existed at a previous point in time.

Time travel helps to audit, debug, recover, and ensure productivity in analytics or ML. This is accomplished by storing historical versions of data along metadata often in systems like Snowflake, Delta Lake, and BigQuery.

Data versioning refers to keeping multiple historical versions of a dataset, each assigned a unique identifier.

Data versioning is vital since it helps to track the dataset version used for model training or audit purposes. It also performs rollbacks to a previous dataset and ensure reproducible outcomes in analytics or ML workflows.

Example Use Case

Suppose your team training a ML model last quarter and now necessitate to recreate the exact database used back then. Time travel can help in querying or cloning the dataset as if it existed at the particular moment. This ensures consistent and reproducible training results.

Distributed Processing Concepts

Distributed processing concepts refer to how large-scale data tasks are split across multiple machines to handle huge volumes of data in an efficient and reliable manner.

The core ideas include workload distribution [portioning data to reduce execution time], scalability, fault tolerance [data is replicated across nodes and failed tasks are automatically reassigned], data locality, transparency and coordination and communication.

Since distributed processing involve dividing workloads across multiple machines, it ensures scalability, increased performance, and handling fault tolerant data to warrant faster computation, cost efficiency as well as continuous availability.

DEV Community

Key Concepts Every Data Engineer Should Master

Top comments (0)