Muinde Esther Ndunge

Posted on Nov 2, 2023

Data Engineering Roadmap 2023

#datascience #dataengineering #data #beginners

Introduction

Data engineering is a crucial field within the broader realm of data science and analytics. It involves the collection, transformation, and storage of data to make it accessible and useful for analysis. As a beginner in data engineering, you may feel daunted and wonder how to get started and build a successful career in this dynamic and in-demand field. This roadmap will guide you through the essential steps and concepts you need to master as you embark on your data engineering journey.

Data engineers use tools such as Java to build APIs, Python to write dashboard ETL pipelines, and SQL to access data in source systems & move it to target locations.
This roadmap has been broken down into monthly deliverables.

Month 1: Basics of Programming

The first thing to master as a data engineer is a programming language. The most common programming language is Python which will enable you to kickstart your data engineering journey.

Python is a versatile programming language because it is easy to use, has multiple supporting libraries, and has been incorporated into every aspect of Data Engineering processes.

Understand Python basics that is Operators, Variables, and Data Types
Learn working with data files this includes learning Python libraries like pandas which are widely used for reading, and manipulating data.
Learn the Basics of Relational Database
- SQL Server/MySQL/PostgreSQL

Learn the fundamentals of computing

Master Git and GitHub version control
Focus on shell scripting in Linux, you'll be using shell scripting for cron jobs, setting up environments, etc
Web Scraping is part and parcel of a Data Engineer's job. We need to extract data from websites that might not have a straightforward helpful API

Month 2: Databases

Relational databases are one of the most common core storage components used in data storage. One needs a good understanding of relational databases to work with large amounts of data.
One needs to master the following:

Keys in SQL
Joins in SQL
Rank Window Functions
Normalization
Aggregations
Data wrangling and analysis
Data modeling for warehouse

Month 3: Cloud Computing

Learn about cloud platforms that deliver computing services over the internet.
The three main choices available are

Amazon Web Services(AWS)
Microsoft Azure
Google Cloud Platform(GCP)

You can pick any cloud platform as you learn, it will be easier to master the others. The fundamental concepts are similar, with just slight differences in the user interface, cost, and other factors.
At this point, you understand the basics of programming, SQL, web scraping, and APIs as well. This is enough to work on your first project which could be bringing in data from a website, transforming it using Python, and storing it in a relational database. You can move the data to the cloud depending on which cloud computing you have chosen to work with.

Month 4: Data Processing

Learn how to process big data. Big data has two aspects, batch data, and streaming data. We need specialized tools to handle such intensive data and one of the popular ones is Apache Spark. Focus on the following learning Apache Spark

Spark architecture
RDDs in Spark
Working with Spark Dataframes
Understand Spark Execution
Broadcast and Accumulators
Spark SQL

Learn ETL pipelines using Python spark, data preprocessing libraries constructs like Numpy and Pandas.

Month 5: Big Data Engineering

Here we will build up on what we did during the previous month. Learn Big data engineering with Spark, optimization in Spark, and workflow schedules.
The ETL pipelines you build to get the data into databases and data warehouses must be managed separately. We need a work scheduling tool to manage pipelines and handle errors

Learn the following concepts in Apache Airflow

DAGs
Task dependencies
Operators
Scheduling
Branching

Month 6: Data warehousing

Getting data into databases is one thing, the challenge is aggregating and storing data in a central repository. You will first need to understand the differences between a Database, Data Warehouse, and Data lake. Understand the differences between OLTP and OLAP
There are several data warehousing tools available;

Redshift
Databricks
Snowflake

Month 7: Handling data streaming

Data streaming is the continuous flow of data as it is generated, enabling real-time processing and analysis for immediate insights.
To ensure that data is being ingested reliably while it is being generated we use Apache Kafka

Learn Kafka architecture
Learn about Producers and Consumers -- Create topics in Kafka

There are other tools used for streaming data such as AWS Kinesis, again you're not limited to which tool to use.

Month 8: Processing streaming data

After learning how to ingest streaming data, learn how to process data in real-time. You can do it with Kafka but it is not flexible for ETL purposes as Spark Streaming
Focus on

DStreams
Stateless vs. Stateful transformation
Checkpointing
Structured Streaming

Month 9: Data transformation

Every data engineer has to transform data into a form that the other members of the organization can use. Data transformation tools make it easy for data engineers to do so.
Focus on DBT as many companies are using it

Learn how to use compiler and runner components
Model data transformation

Month 10: Reporting and Dashboards

This is mostly the end product of data, where the data has already been transformed, insights driven from it, and ready to be presented to stakeholders. One can use any tools to visualize and create dashboards. Such tools include:

Power Bi
Tableau
Looker

Month 11: No SQL

When working with relational databases, the data always needs to be structured and the querying is not that fast when working with large data hence we have NoSQL. These databases deal with structured and unstructured data
You can focus on learning one NoSQL database like MongoDB since it is popularly used in the industry and is easy to learn
Focus on:

CAP theorem
CRUD operations
Documents and Collections
Working with different types of operators
Aggregation Pipeline
Sharding and Replication in MongoDB

Month 12:Building projects

Even though you will build projects in each step, by now you have an understanding of the essential tools in data engineering. To showcase your skills, build a capstone project and keep learning.

Conclusion

This breakdown allows you to progressively build your data engineering skills over the year. You can adjust the pace of your learning based on your personal preferences and the time you have available. Consistent practice and hands-on experience will be crucial in mastering the field of data engineering.

DEV Community