A data engineer’s role is unique - they lay the essential groundwork for data scientists and analysts to leverage high-quality data for insights that aid better decision-making. However, becoming a data engineer involves mastering many skills and technologies, which can often be quite daunting.
In this post, we have curated the five best resources for data engineers. We’ve included one resource for each important data engineering skill: Databases, Data Modelling, Data Processing, Distributed Data Systems, and Workflow Management.
Working with databases is the bread and butter of every data engineer, as they are the basic building blocks for any data-driven application. Designing and architecting database structures require a firm understanding of the related concepts such as joins, keys, relationships, etc.
This free YouTube course is a great learning resource for everything related to database design. For NoSQL databases, this Coursera specialization is a perfect learning resource to know all about different NoSQL databases, the differences between them, and how to model data in them.
We also recommend Database Design for Mere Mortals, a fantastic book on planning, designing, and structuring modern databases.
Modern data warehouses leverage OLAP databases for processing data and analytics. Having a thorough understanding of the OLAP modeling concepts is vital to manage your data warehouse efficiently.
The Data Warehouse Toolkit is a one-stop resource for all your learning needs regarding data modeling. It has comprehensive coverage of the most up-to-date data modeling techniques, including the enhanced star schema dimensional modeling and a lot more. Do check it out!
It would be unfair not to mention another great book - Agile Data Warehouse Design - which covers everything you need to know about fine-tuning high-performance data models in modern data warehouses.
Data processing is a must-have skill in every data engineer’s arsenal. It involves an understanding of batch processing and stream processing. When it comes to stream processing, having hands-on knowledge of tools like Apache Spark, Flink, Kafka, etc., is very important. For this, we recommend Spark: The Definitive Guide as a must-have guide to building modern streaming applications.
Batch processing with tools like Hadoop and MapReduce is considered a thing of the past now, but it is still highly recommended to have a fundamental understanding of how it works. Hadoop: The Definitive Guide is a fantastic resource to get a hands-on understanding of what batch processing is and how the Hadoop ecosystem works.
Every data engineer’s core responsibility is to build a robust, secure, and enterprise-grade data architecture that brings together different tools and technologies. For this, having a thorough understanding of distributed systems is crucial.
Therefore, we highly recommend the book Designing Data-Intensive Applications. It perfectly highlights the concept of distributed systems upon which modern data-driven applications are built. This book also demonstrates how to identify the strengths and weaknesses of different data tools and balance the tradeoffs around their complexity and scalability.
Having a sound understanding of workflow management and scheduling tools like Apache Airflow will help you build data pipelines more efficiently and automate all the dependency management tracking.
We recommend this tutorial series on Apache Airflow, which gives a wonderful introduction to the concept of DAGs (Directed Acyclic Graphs) and shows you how to build modern data pipelines using Airflow.
For a fundamental understanding of what DAGs are, we recommend checking out this highly informative video.
Data engineering is a unique crossover of data science and traditional software engineering. Needless to say, as a data engineer, you will write a lot of code to build data pipelines, ETL systems, etc. This generally involves working with different programming languages.
Python has emerged as the de-facto programming language for data engineering, and mastering it is becoming increasingly essential. While there are many awesome Python programming resources, we highly recommend the official Python documentation to learn the basics and advanced Python concepts from scratch.
SQL is an essential skill for anyone who wants to work with data - and mastering it a must for every data engineer. This useful SQL tutorial is a brilliant quick reference to the essential and more advanced SQL concepts.
Finally, there is an increasing need for having sound data security and data governance policies in place in almost all modern organizations these days. For data engineers, this means having a fundamental understanding of how to manage sensitive data and establish compliance with regulations like GDPR and CCPA. For this, we highly recommend this free resource on data security.
This article gave you a sneak peek into some of the best learning resources for honing your data engineering skills. There are, however, two very important things to note:
Firstly, and most importantly, data engineering is mostly a hands-on, on-the-job role that relies heavily on practical experience. Your best bet to become a good data engineer is to build your data systems from scratch and learn from the mistakes you make along the way. The learning resources we mentioned in this article will certainly help you, but they will only take you so far.
Furthermore, it’s very hard to find a data engineer whose day-to-day job utilizes the whole spectrum of data engineering skills. Companies usually look for data engineers whose skill set aligns with their requirements and business use-cases. However, a fundamental to intermediate knowledge of all the essential prerequisites will set you up for a successful data engineering career.