Understand Data Engineering. The Beginners’ Guide.

What is Data Engineering

The best definition of data roles I have seen stated that

“If data is to be equated to a car, a data scientist is the driver and a data Engineer builds the road”

This is quite accurate, because while data scientists focus on extracting insights from data and using data to make decisions, data engineers prepare the that data and make it available for analysis. Data engineers design and build systems and pipelines for collecting, organizing and storing data.

This processes often involves getting data from data lakes, ie Databases, APIs etc., where data is unorganized and virtually unusable, to organized storage facilities called data warehouses. In data warehouses, the data is transformed so that it starts to make sense and is usable for analysis.

Extract, Transform, Load (ETL)

As mentioned earlier, a data engineer’s job will mainly involve moving data from lakes to data warehouses. This takes place in stages described as ETL.
What is ETL? I’m sure you have a feeling I’m about to tell you.

E. Extract
This involves collecting and aggregating data from their various sources. This could be from APIs, Databases, data collection systems like Enterprise Resource Plan software, CRMs, etc. This data is mostly unstructured, unorganized widely varying and can’t be useful in this form.
T. Transform
In this step, after the data has been aggregated from its various sources, It’s time to make it make sense. Data is cleaned, by removing irrelevant data, dealing with missing data, combining data from different sources to make it more relevant, dealing with inconsistencies with data types. Generally making the data ready for analysis. this is done through scripts written in programming languages like Python and SQL (more on this later)
L. Load
Here is where the transformed data is stored in a data warehouse while being well organised and ready for analysis. Now analysts and data scientists can take over from here. Commonly used data warehouses include Snowflake, Big Query, Redshift among others

ELT (Modern ETL Variant)

Modern advancements in data engineering have seen developments of ways to perform transformation operations on data while already in the warehouse. With development of tools like DBT, Matillion and many others, Data can be transformed directly from the warehouses. This is more efficient since it makes use of processing power of modern databases in the cloud to perform transformations.

Data Modelling (briefly)

Data Models are conceptual representations of data. They define how data is organised, stored, accessed and updated. This models can be simple or complex. Simple data models have minimal structure and complexity, cannot represent complex relationships and are used for basic and straight forward data storage.

Complex data models, on the other hand, have intricate and complex structures. They model complex scenarios and relationships and they require more effort and expertise to design and implement. Data models are presented using Entity Relation Diagrams, Control Diagrams and other Data flow Diagrams.

The purpose of modeling data is to ensure consistency and integrity of all the data that goes through the pipeline. A lot of data is processed through the same structures and it is required that the same output be achieved every time. Having a fixed way to organize this data ensures that consistency.

How To Do All This?

Excited? Good. Lets see how to get into the fascinating world of data engineering.

Breath In.

Not the start you expected, I know, but you must understand why good Data Engineers are in demand and are paid how they are paid. It is because they do important work by learning what everyone else doesn’t want to learn. Is it Easy? If it was, everyone would be doing it. Also note that you will not understand and be good at it by reading a book or watching a video, you have to build stuff yourself.
Salary: Data Engineer in Nairobi, Kenya 2023

Pick a Language.
Data engineers use programming languages to achieve the things mentioned above. Languages that can be used include Java, Scala and Python. Python is used extensively because it is easy to learn, read and write, and it also has good support in terms of developer community and libraries for various data related tasks. Apart from programming languages, data engineers work with databases which necessitates mastery in SQL, Structured Query Language.
Another tool important to know at this point is the command line. Masterng the CLI gives you complete control of your flow and you can easily monitor operations in real time. Learning this things can take time to get to a level of mastery, but consistency is the rule of the game.
Master Basic Computer Science.
Hear me out. Code will be involved. Resources will be utilized, so there will be a need to optimize. Having an understanding and appreciation for data structures and algorithms will go a long way to ensuring your code does what it is meant to do while being as efficient as possible. Those AWS instances are quite expensive.
Understand ETL and ELT.
This concepts are discussed in a previous section. Its how data will be moved from sources to data warehouses. There are many ways and tools that help with this process. Every once in a short while, a new ‘improved’ tool is released. SQLpipe, Airbite, FiveTran, Estuary, are just a few mentioned. You can also write custom scripts for some of your work, but thos will require good mastery of ETL processes and programming.
While at it, have a good understanding of how data warehouses work. How data is organised in this warehouses and how to apply tools like DBT, Informatica, etc to transform data in the warehouses.
Data Orchestration
Data pipelines have many tasks and operations that have to work in perfect sync so that data can flow smoothly. Data Orchestration defines, schedules and monitors pipeline operations making their management easier. For simple pipelines, a custom script should suffice, but when the pipeline gets larger, more complex and the oprations are many and more sensitive, it is okay to get help. Luckily, there are systems developed for this job. Apache Airflow and Prefect are useful examples.
The Cloud.
The cloud is just someone else’s computer, that you can access from your own computer through the internet. Data pipelines may work well in your computer, but you won’t give your computer to the data scientists. Some operations require more resources that your computer can offer, so using resources in the cloud platforms like **Microsoft Azure, Google Cloud Platform and Amazon AWS, **is very important.

Finally

Finally, as I said in the beginning, this article won’t make you a data engineer. It is imperative that you get some hands on activity going. Build stuff that will make your skill better and also be proof that you can ship good data engineering pipelines and solutions. Only then can we say that the Data Engineer title has been achieved. Its hard, I know, but a Data Engineer i know always says: