DEV Community

Cover image for Data Engineering Lifecycle - Basics - Theory
Fady GA 😎
Fady GA 😎

Posted on

Data Engineering Lifecycle - Basics - Theory

This is the first article in my Data Analytics Made Simpler series. To know more about it, please visit it's website where I'm giving more details about what's exactly this series and what's my motivation to do something like that and how I plan to release its content!

I wanted to start my series by drawing a mental image of what it will cover. In this article, I'll explain what I mean by "data analytics" in the series title and will give more details on what exactly is Data Engineering and will explain its general components through the Data Engineering Lifecycle framework.

For the same article but in Arabic, please click here

Table of Content:

What I mean by Data Analytics:

The term Data Analytics is very broad and can have different meanings for different people! So, I want to be clear about it.

When I talk about Data Analytics, I almost visualize the following diagram in my head.

Data Analytics

In my head, a huge chunk of Data Analytics is Data Engineering that I'll be explaining shortly, regardless of the data volume, variety, or velocity (understanding those terms will become easier over time, trust me πŸ˜‰).
The remaining chunk is Data Visualization. To me it's the most natural thing to do with data after that we have done stuff to it in the Data Engineering phase, but it isn't the only thing we can do.
By all means, this is not an official definition for Data Analytics but this is how I'll be tackling it in my series.

What is Data Engineering?

The overly simplified version of the Data Engineering definition is as follows:

Data Engineering is the data field concerned by moving data from source to destination systems while applying transformations to the data according to the business requirements.

For example, imagine an enterprise with multiple applications. Due to different Development teams, each application is generating its logs with its own format different from the other applications but it still contains the same info. Now, the management wants a dashboard to show users access trends over the past 3 months aggregated for all the applications.
The operation of collecting the logs from each application, unify their format and extract the users access data then perform aggregation to it, and finally loading the aggregated data to a data warehouse (more on that later on in the series) that the dashboard will eventually use, is called, you've guessed it, Data Engineering!

Now that you have some idea on what Data Engineering is, I should add to the previous definition that Data Engineering isn't just the data movement and the applied transformations, it's the design, implementation, and maintenance of the whole automated system (called, pipeline) that performs the whole thing!

You may now get the impression that Data Engineering is an enterprise thing! You won't be wrong! But you won't be entirely correct either 😁. Think about it, if you have a bunch of Excel workbooks laying around your laptop of say, monthly sales for the past 15 years with each month is in its own workbook and you've created a Python script to open those up, get the monthly sales total, and finally load all of those in an aggregated report. According to the mentioned definition, you've kind of did Data Engineering!

The Data Engineering Lifecycle:

I'll start this section by showing it to you then I'll explain:
de-lifecycle

So, if we talked about Data Engineering leaving out the technical details and tools selection, we will get a "framework" that tries to explain how it works, i.e. The Data Engineering Lifecycle!

If you take a closer look at the previous figure, you will be able to verify the earlier Data Engineering definition! You will see that we are extracting or "ingesting" data from sources that "generate" it, applying "transformations" to it, then finally load or "serve" it to destination systems that will generally be either analytics or machine learning workloads.
You may have noticed that there is a "storage" block that spans over the "ingestion", "transformation", and "serving" blocks. This is because we can actually store data at any stage of the lifecycle. For example, a common Datalake architecture (we will know about Datalakes later in the series) is to store raw data before processing. Then, do some filtering and cleaning transformations and store the resulting data in a dedicated location. And finally, do some aggregations to the processed data and store those as well in a dedicated location.

This is called the Medallion Architecture as the raw area is called the Bronze stage, the processed area is called the Silver stage, and the aggregated area is called the Gold stage.

One other thing to know about the Data Engineering Lifecycle is that it isn't a sequential flow! Meaning that you can (and often will) do blocks out of the illustrated order and/or more than once. For example, we can serve the ingested data directly to some "data consumer" systems before any transformation then perform the transformations and serve the final output to other systems.

What a Data Engineer is NOT:

There are a lot of Data related roles out there with blurry boundaries between them. In my opinion, this could be due to the rapid evolution of how we (or the systems) produce data and consume it, which adds scope to existing data roles or creates new roles entirely.
I just want to clear the confusion about what a data engineer is not to better understand what a data engineer actually is!

  • A Data Engineer should be familiar with statistics if their pipeline requires statistical transformations but it's a "Data Analyst's" role to make sense of the data and use statistical analysis to better understand it and uncover hidden insights!
  • A Data Engineer might be familiar with statistical modelling and machine learning algorithms, but it is the job of a "Data Scientist" to create models from data for machine learning use cases.
  • A Data Engineer should know how to handle production workloads, but it is the job of a "Machine Learning Engineer" to deploy production Machine learning systems.
  • A Data Engineer works a lot with databases, but it's the role of a "Database Administrator" to handle the administration part of a database like users' privileges, DB maintenance, backups, ..., etc.
  • A Data Engineer could know a bit of Software Engineering (in fact, it will be very beneficial if they do!) but it is the job of a "Software Engineer" to create apps even if data related.

Like we've seen in the Data Engineering definition, A Data Engineer should focus on the design, creating, and maintaining their Data Pipeline that works within the Data Engineering Lifecycle framework working with some data role like DB Admins or Software engineers to serve other Data roles like the Data Analysts and/or Scientists!

Conclusion:

Data Engineering is a relatively new Data role and is still evolving in my opinion. I just wanted here to give you a taste of what we will be handling in the Data Analytics Made Simpler series and to help you identify the scope of a Data Engineer.

References:

I've used the same approach explaining Data Engineer as the "Fundamentals of Data Engineering" book by Joe Reis and Matt Housley and explained the Data Engineering Lifecycle as they did in their book. A highly recommended read that mostly doesn't require prior knowledge of Data Engineering.

Top comments (0)