Mirina-Gonzales

Posted on Apr 11

The Data Engineering Lifecycle

#architecture #data #dataengineering

The goal of this blog post is to explore the Data Engineering (DE) Lifecycle and understand how each of its stages works. To build this content, I am using the book Fundamentals of Data Engineering by Joe Reis & Matt Housley as a reference.

¿What is Data Engineering?

Data Engineering is a discipline focused on designing, building, and maintaining flows that transform data from a source to a final storage destination or to the users who need it.

There is often confusion between the Data Lifecycle and the Data Engineering Lifecycle; that is why we will make a distinction, although we will focus primarily on the engineering side.

1. The Data Lifecycle (Management Focus)

The Data Lifecycle is a separate concept. To define it, we refer to the DAMA-DMBOK (Data Management Body of Knowledge). Here, the data lifecycle is defined similarly to that of a product: it is born, it is used, and eventually, it dies.

Planning: Defining requirements and architectures.
Enablement / Design: Creating the systems and tools needed.
Creation / Acquisition: The entry point of the data.
Maintenance and Storage: Processing and ensuring persistence.
Usage: Providing real value to the business.
Enhancement: Quality techniques and enrichment.
Archiving: Keeping data for legal or historical reasons.
Purging / Deletion: Secure deletion once utility ends.

2. The Data Engineering Lifecycle

This cycle focuses on the "pipeline" and the technical stages that turn raw data into valuable resources. It is divided into 5 main stages:

Generation
Ingestion
Storage
Transformation
Serving

In addition to these, there are undercurrents (cross-cutting concepts) that intervene in every stage: Security, Data Management, DataOps, Data Architecture, Orchestration, and Software Engineering.

Deep Dive into the Stages

Generation

The origin of everything. This is where data is born.

Knowledge vs. Control: We aren't always the owners of the source (very common!), but we must know how data is generated, its frequency, and its velocity.
Communication: Talking to source owners helps anticipate changes. Every source has unique limits.
Schema Evolution: The more a schema changes, the more challenging it becomes to keep the pipeline stable.

Storage

Choosing where to save data defines your freedom to operate. In the cloud, cost depends on the "storage temperature":

Hot Storage: Constant queries (low latency).
Cold Storage: Sporadic queries (e.g., once a month).
Archive Storage: Historical data "frozen" (high latency).

Tip: While storage seems cheap, as a company scales, costs can become a headache if you don't choose wisely from the start.

Ingestion

This is where bottlenecks often appear. You must decide your strategy based on the use case:

Batch: Defined intervals (e.g., reading a DB or a CSV once a day).
Streaming (Real-time): Continuous flow (e.g., sensors or app events).

Transformation

We move from "dirty" data to data that follows business rules. Data Wrangling and cleaning are what guarantee the end user receives value and not "garbage."

Serving

This is where we finally see the value of our work. Data is exposed for:

Analytics (BI)
Machine Learning
Reverse ETL

Undercurrents

To keep the cycle from breaking, we need these foundations:

Security: Access only for those who need it. Additionally, a data engineer must keep in mind that they are responsible for the data they access and must ensure its security throughout their developments.
Data Management: Ensuring that data is understandable, maintains high quality, and respects privacy.
DataOps: Applying DevOps culture to data. Its goal is to automate, monitor, and detect errors quickly.
Architecture: The architecture team is responsible for understanding the business holistically to make strategic decisions. Unlike an engineer, who may focus on a limited number of projects, architects must know the "full map" and balance business needs with the right technological solution. It defines how components communicate and ensures they work not just for now, but for future developments.
Orchestration: The process of coordinating multiple jobs to run quickly and efficiently, either on a defined schedule or as needed. The orchestrator manages dependencies and monitors executions. A good design must account for the fact that processes can fail, defining automatic retry rules to avoid constant human intervention.
Software Engineering: There are common areas between software engineering and data engineering; topics such as streaming, infrastructure as code (IaC), pipeline as code, development, and the use of open-source frameworks.

Which of these 5 stages has given you the most headaches lately? I’ll read your thoughts in the comments!

DEV Community