maninekkalapudi

Posted on Nov 1, 2022 • Originally published at maninekkalapudi.com on Nov 1, 2022

A Typical Data Pipeline

Introduction

Hello people, Hope you are doing well.

As data engineers, we build data pipelines to collect data from different source systems and place it in an analytics system i.e., a data warehouse/data lake. The data is usually sourced systems like a database, web events, or an API and etc.

In this post, we will talk about

What is a data pipeline?
Stages of a data pipeline
What is ETL and ELT?
What is the difference between data warehouse and data lake?
Why do we need a data pipeline?

1. What is a data pipeline?

We know that a data pipeline collects data from different source systems and moves it to the analytics system. Let's improvise that definition a bit

A data pipeline is a series of interconnected systems that passes data in only one direction, i.e., from source to serving layer with increasing order of clarity and value in the data

The data received from a source may contain duplicate records, test records and records that are problematic. Data pipeline should be designed in such a way that it eliminates all these issues and only then is the data moved from raw to staging to serving thus increasing the clarity and value in it

2. Stages of a data pipeline

A typical data pipeline will have the stages below (refer above picture).

Sources
Raw
Staging
Serving

The below picture shows the different stages of a data pipeline

Data Sources : Where data is generated, recorded or obtained for the data pipeline. For example, a database, an SFTP file server, an API and etc.
Raw layer : It is the first level of data storage in the data warehouse. This is an archival storage layer and the data stored in it will not be modified at any time.
Staging/Transform layer - It is the second level of data storage in the data warehouse. The data in this layer, which is sourced from raw layer, is cleaned, transformed into a certain format and then stored in various staging tables.
Serving layer : This is the aggregated data layer, and the data will be sourced from different staging tables. For example, average amount spent by the customers over the years

3 . What is ETL and ELT

ETL stands for Extract, Transform and Load. ETL/ELT is a process to get the data into an analytical system like a data warehouse or data lake, preferably on a schedule.

ETL

In ETL, the data is extracted from a source, transformed into final format and loaded into DWH table. This is well suited when the data is in a database where the data needs minimal changes.

ELT

In ELT, the data is extracted from a source, loaded into the raw layer. When the transformations are defined in the future, the data is transformed readily and loaded into the final tables in the data warehouse.

ELT is well suited when the data comes from different sources and the transformations for all the data are not defined completely. The data is transformed when they are available, and the subsequent staging table is modified accordingly.

4 . What is the difference between data warehouse and data lake?

4. Why do we need a data pipeline?

Large volumes of data come in from different sources (Apps, web events, transactions, images/videos, telemetry data). These data are of different types, and sizes

These characteristics i.e., volume, variety, velocity is attributed to Big Data. To process the big data in a consistent manner while serving the data to an entire organization is a huge challenge. Data pipelines are built by data engineers to solve this problem.

Once the data is processed, it is stored in a centralized data repository like a data warehouse. This ensures that the data is not siloed and anyone with the right access can always access the data and perform the analysis.

Data pipelines can also ensure data quality at a scale. Any checks that need to be performed on the data can be applied on the transformation stage ensuring quality.

Since we also maintain the raw data, the pipelines will be made resilient of failures, any data loss or corruption can be eliminated using them by reprocessing the existing data.

Check out the video on the same topic: