5 Essential skills for becoming a Data Engineer

#dataengineering #datascience #beginners #career

The Data Engineer skills is a hot topic, for everyone interested in becoming a one. The rise of data platforms wouldn't be possible without the data engineer skills, that develop the infrastructure and tools. The need for specialists in this area is forecasted to only increase, therefore if your are considering on becoming a Data Engineer, those are my 5 essential skills you have to poses. Enjoy!

Programming - Scala and/or Python

The first and the most important skill - programming. The job of a Data Engineer requires to write a significant amount of custom, tailored code, which cannot be otherwise generated using fancy GUI software. The ability to code and to create readable, clean and scalable code is a must.

I have chosen two programming languages here: Scala and Python. I am not going to lie, that Scala is my personal favorite (if you haven't guessed already by the number of posts on this topic :)) although Python is also a good choice.

Some of the Data Engineering frameworks have been written in Scala (eg. Apache Spark), and it works great when you need more control over the code.

Python is probably much easier to start with and generally more newbie friendly, although it has some performance issues, which in data world might be critical.

A good option is to learn one of those to a proficient stage, and other one to at least basics - it is always better to have another tool under your belt, even tough it might not be as good as others.

If you are looking to learn Scala, be sure to check out my post that explains where to start from.

Data Manipulation - SQL

Despite the fact that in data world there is a flood of semi and unstructured data, SQL is still the de facto data manipulation language. Whether you are working with relation database, data warehouse or a processing engine like Spark - SQL is widely supported.

Some of the technologies use their own variations of SQL, like a CQL in Cassandra distributed database system, and there isn't anything on the horizon that would make SQL go away.

Not all data engineering tasks are focused on building fancy Machine Learning pipelines, or Stream Processing. Lots of simple problems can be solved by writing a SQL query. Modern SQL engines can handle automatic optimization of those queries, which is great if high performance programming is not your strong skill.

Processing Engine - Apache Spark

The third element on the list, is probably my favorite - Apache Spark processing engine. As a Data Engineer you don't want to reinvent the wheel - you want to use the framework that is capable of solving majority of problems in unified and repeatable way.

The days of Hadoop's MapReduce are long gone, and the Apache Spark is something you should focus on. It is the most popular general purpose processing engine, that can to ETL, batch and streaming processing, machine and deep learning and many more.

The nice thing about Spark is that it supports 3 languages: Java, Scala and Python. R is also supported by to lesser extend.

Learning Apache Spark may seem cumbersome at the start, but once you get a good grasp of the basics, it's very pleasant to work with. However, once you will go to the advanced topics, like optimization, various types of testing and deployment, the learning curve becomes quite steep.

If your are looking for a quick intro on how to get started with Spark, check out my post on it.

Cloud - AWS or GCP or Azure

I hope I don't have to convince you that cloud is now everywhere. The data platforms and data engineering pipelines are no different - lots of those utilize cloud services.

Storing data on AWS S3, streaming data with GCP Pub/Sub or storing data in Azure Cosmos DB - those combinations are frequently found in lots of commercial systems. Whether you are just starting with the cloud, or have some previous experience - being a data engineer means using the cloud regularly.

Out of those three, the AWS is probably the safest choice. It's the most popular, has the most services, is widely known and there are lots of tutorials that can guide you through.

My personal favorite however is the Google Cloud Platform - it's extremely developer friendly, has an awesome selection of services and access to the most advanced AI and ML products. Google is pushing strongly into the cloud territory and I for a good reason - companies like the flexibility and agility it gives to developer teams.

For the Azure, I don't have a lot of experience using it, so I leave this choice to you. Although I know that they work closely with Databricks, a company behind Apache Spark, and this might bring some interesting advances in data engineering field.

If you are considering learning GCP and getting ACE certified, be sure to check my post on best resources to prepare for the exam.

Deployment - Docker and Kubernetes

Last but not least - the deployment. And even more precisely - containerization. The Docker and Kubernetes are combination that made the containers a sensible alternative to cumbersome VMs and clusters, like in the old days. Now, companies want to ship their products fast, and be ready to scale when business is getting traction, therefore scalable deployment is a must.

Docker will help you pack your applications into a containers, that can be easily scaled up or down, depending on a need. Kubernetes handles the orchestration of clusters of containers - in the end data platforms are quite complicated beings, with lots of moving parts and you want to ensure they are prepared for extra demand and resilient to failures.

If you are looking for some nice tips and tricks when working with Docker, be sure to check out my post on this.

Summary

I hope you have found this post useful. If so, don’t hesitate to like or share this post. Additionally, you can follow me on my social media if you fancy so :)