With an extensive background in data science, analytics, and cloud computing, I am consistently asked the same questions repeatedly.
Besides wanting to know the difference between a data engineer and a data scientist, one of the most common questions is, what skills should I learn as a data engineer?
It's an excellent inquiry for new or prospective data engineers based on the opportunities available.
The fact of the matter is, companies need data engineers more than ever before. At our current pace, there are approximately 2.5 quintillion bytes of data created every day --- a figure that continues to grow at an accelerated pace. By 2025, experts estimate that the world will create 463 exabytes of data each day. That is the equivalent of 212,765,957 DVDs per day.
To better utilize data, companies are now realizing they need to hire data engineers to take their data from point A to point B. That way, data scientists and analysts can easily use it, increasing efficiency and productivity. That is why "data engineer" is the fastest-growing job title, according to a 2019 analysis.
To assist you as a new data engineer, I have created a skill set pyramid, which can be thought of as a hierarchy of skill set needs. This will help you focus on the skills you should learn first, allowing you to build a solid foundation as you move onto more specific skills. Just remember, the way you learn each step of the pyramid does not need to be overly rigid, staying in a strict order. You can layer each step, helping you progress as you learn. Let's get started!
Recommended reading: 5 Mistakes New Data Engineers Make
Python and SQL
At the base of the pyramid, I recommend learning Structured Query Language (SQL) and some form of coding.
When I say coding, I mean learning the core concepts, such as loops, if statements, functions, and data structures. You need to understand what they are, what they do, and how they operate. Why would you want to use one over the other?
To become a successful data engineer, you need to be a proficient programmer. Currently, we live in the age of Python, which continues to be a standard entry point. This programming language is perfect for websites, scripting, and data. SQL is the language of data and relates to automation, scripting, and database modeling. Despite its age, it continues to play a pivotal role in managing and processing data.
Both SQL and Python are the most common technologies listed in job listings. Whether a data engineer is working for Apple or a small startup, they must be experts in SQL; and Python also remains in high demand.
The best languages and technologies for you will depend on what you aim to specialize in. For example, those who are experts in data processing may be highly proficient in Spark or AWS. However, before you reach that point, you need to learn the basics.
ETL and Data Warehousing
The next level includes ETLs (extract, transform, load) and ELTs, which are the processes that allow you to take data from one point to another, typically using a tool or programming. The data is processed, extracted, often transformed, and then loaded into a data lake or data warehouse. Understanding how to move data is critical for the next set of skills associated with data warehouses, data lakes, and sometimes, data lake houses --- which is growing in popularity.
- Data warehouses will help you understand data modeling and why experienced data engineers process data in certain ways. Gaining this insight will allow you to ensure greater consistency, helping companies make more informed decisions.
- Understanding data lakes based on their role in companies, as this option allows businesses to manage data in a manner that is often less expensive and process heavy, compared to data warehousing.
- Data lake houses is a term that has become popular over the past year. Again, companies are finding this an appealing option as it combines elements of both data warehouses and data lakes.
You can spend a lot of time learning about the three systems above, as there are many best practices in terms of ETLs, data modeling, etc. Don't rush through this layer of learning, as it is the "meat and potatoes" of data engineering.
Ask yourself critical questions, such as:
- What are these three concepts? Where have they evolved from and where are they going?
- What is the difference between ETLs and ELTs?
- What is the goal of this layer from a business perspective?
Cloud, DevOps, and Data Visualization
Once you gain more experience, the basics behind this step are fairly straightforward. However, when you are first developing data engineer skills, everything can seem overwhelming --- only because there is a lot to learn.
- Start by understanding the cloud in terms of serverless computing, cloud data warehouses, etc. If you work for a startup in the future, this knowledge will be valuable.
- DevOps will help you take code from your environment into a production environment. Become familiar with git --- a tool that is used for source code management.
- While learning about data visualization, you will pick a tool such as Tableau. Learn best practices as well.
Streaming Data, Distributed Computing, and Specialization
Once you have learned about the top three layers and the concepts within them, you can become more specific with your approach. Since you'll have a background in ETLs and data warehousing, and will be accustomed to working with the cloud, setting up something on AWS Kinesis will come more naturally to you.
At this stage, you can also dive deeper into distributed processing, as well as the pros and cons of using that kind of system.
Some data engineers strive to become a specialist, working either strictly with Microsoft, Azure Data Factory, and the list goes on. Many companies are looking for experts in specific areas, so that is something that many new data engineers take into consideration while honing their skills.
The best part of being more knowledgeable is that you have the freedom to choose what you'd like to focus on. Some enjoy building infrastructure components while others prefer building data products.
As a new data engineer, your goal is to help companies better manage their data --- and regardless of how big or successful a company is, there will always be data problems. This is great for budding data engineers because it increases the probability of high job security.
In summary, what skills should data engineers have?
- You should be able to build and maintain database systems.
- Understand and be fluent in programming languages, especially Python and SQL.
- Know how to find and use warehousing solutions, as well as ETL tools.
- A thorough understanding of cloud technology, data viz, etc.
- You should also familiarize yourself with the most essential programs, building software-specific skills based on your expertise. For example, skills that are specific to Redshift, Azure, Apache, etc.
Unlike data scientists and data analysts, data engineers are more concerned with preparing data, compared to analyzing and interpreting it. Although many of the skills across all three titles overlap, data engineers focus on ETLs, data warehousing, advanced programming, scripting, data visualization, and pipelining. In-depth knowledge of SQL is imperative. Once you hone the skills above, you will have the freedom to master the systems, tools, and models that appeal to you most. Whether you're interested in managing a company's Big Data infrastructure or are drawn to machine learning, your career can start immediately. Leverage the power of the basic skills discussed above today!
Thanks for reading! If you want to read more about data consulting, big data, and data science, then click below.
Realities Of Being A Data Engineer
Developing A Data Analytics Strategy For Small Businesses And Start-ups
5 SQL Concepts You Need To Know Before Your Next Data Science Or Data Engineering Interview
How To Improve Your Data-Driven Strategy
What Is A Data Warehouse And Why Use It
Mistakes That Are Ruining Your Data-Driven Strategy
5 Great Libraries To Manage Big Data With Python
Top comments (0)