Vinicius Fagundes

Posted on Jan 28

Tools of the Trade: What Powers Modern Data Engineering

#datascience #career #dataengineering #python

Introduction

You understand what data engineering is. You know how pipelines, ETL, and warehouses work. Now comes the question every beginner asks:

"What tools should I actually learn?"

The data engineering landscape is overwhelming. New frameworks launch every month. Cloud providers release new services constantly. It's easy to get lost.

In this article, I'll cut through the noise. After years of building data systems and training engineers across organizations, I've identified what actually matters — and what you can safely ignore as a beginner.

Let's build your toolkit.

The Core Stack

Every data engineer needs proficiency in four areas:

Languages — How you write logic
Databases & Warehouses — Where data lives
Orchestration — How you schedule and manage pipelines
Cloud Platforms — Where everything runs

Master these, and you can work anywhere.

Programming Languages

SQL: The Non-Negotiable

SQL is the language of data. Period.

Every data engineer writes SQL daily. You'll use it to:

Query databases
Transform data in warehouses
Debug pipeline issues
Validate data quality

If you learn only one thing from this article: get very good at SQL.

Not just SELECT statements. Learn:

Window functions
CTEs (Common Table Expressions)
Query optimization
DDL (creating and altering tables)

Python: The Swiss Army Knife

Python is the default scripting language for data engineering.

You'll use it for:

Writing pipeline logic
API integrations
Data transformations
Automation scripts

Key libraries to know:

Library	Purpose
pandas	Data manipulation
requests	API calls
sqlalchemy	Database connections
pyspark	Big data processing
boto3	AWS interactions

Other Languages Worth Knowing

Language	When It's Used
Scala	Spark-heavy environments
Java	Legacy systems, Kafka
Bash	Scripting, automation

For beginners: focus on SQL and Python. Add others as needed.

Databases and Warehouses

You'll interact with different storage systems depending on the use case.

Relational Databases (OLTP)

Used for transactional workloads:

PostgreSQL — Open source, widely used
MySQL — Popular in web applications
SQL Server — Common in enterprise environments

Cloud Data Warehouses (OLAP)

Used for analytical workloads:

Platform	Strengths
Snowflake	Ease of use, separation of storage/compute
Google BigQuery	Serverless, great for GCP users
Amazon Redshift	Tight AWS integration
Databricks SQL	Unified lakehouse platform
Microsoft Synapse	Azure ecosystem integration

Data Lakes

Used for raw and unstructured data storage:

Amazon S3
Google Cloud Storage
Azure Data Lake Storage

Which Should You Learn First?

Start with PostgreSQL for relational concepts, then pick one cloud warehouse. I recommend Snowflake or BigQuery — both have free tiers and are beginner-friendly.

Orchestration Tools

Orchestration is how you schedule, monitor, and manage pipelines.

Without orchestration, you'd be running scripts manually. That doesn't scale.

Apache Airflow

The industry standard.

Open source
Python-based
Massive community
Used by most data teams

Airflow uses DAGs (Directed Acyclic Graphs) to define workflows. If you learn one orchestration tool, make it Airflow.

Alternatives

Tool	Notes
Prefect	Modern, Python-native, easier than Airflow
Dagster	Strong data asset focus
Mage	Newer, visual interface
dbt Cloud	For transformation orchestration
Azure Data Factory	Azure-native, low-code
AWS Step Functions	AWS-native workflows

My Recommendation

Learn Airflow first. It's everywhere. Once you understand Airflow, picking up alternatives is straightforward.

Transformation Tools

dbt (Data Build Tool)

dbt has changed how data teams work.

It allows you to:

Write transformations in SQL
Version control your models
Test data quality
Document your transformations

dbt follows the ELT pattern — transformations happen inside the warehouse.

If you're working with a modern data stack, dbt is almost certainly part of it.

Cloud Platforms

Almost all data engineering today happens in the cloud. You need to be comfortable with at least one major provider.

The Big Three

Platform	Data Services
AWS	S3, Redshift, Glue, Lambda, EMR, Kinesis
Google Cloud	BigQuery, Cloud Storage, Dataflow, Pub/Sub
Azure	Synapse, Data Lake, Data Factory, Event Hubs

Which Cloud Should You Learn?

Check job postings in your target market. In my experience:

AWS — Most job listings, largest market share
GCP — Strong in startups and data-heavy companies
Azure — Dominant in enterprise, especially Microsoft shops

Pick one and go deep. The concepts transfer across platforms.

Big Data Processing

When data exceeds what a single machine can handle, you need distributed processing.

Apache Spark

Spark is the dominant big data framework.

Use cases:

Processing billions of rows
Complex transformations at scale
Machine learning on large datasets

You can write Spark jobs in Python (PySpark), Scala, or SQL.

When Do You Need Spark?

Honestly? Not as often as people think.

Many teams reach for Spark too early. Modern warehouses (Snowflake, BigQuery) handle most workloads without needing Spark.

Learn the basics, but don't obsess over it until you're dealing with truly massive datasets.

Streaming Tools

For real-time data processing:

Tool	Purpose
Apache Kafka	Message streaming, event backbone
Apache Flink	Real-time stream processing
Spark Streaming	Micro-batch streaming
Amazon Kinesis	AWS-native streaming
Google Pub/Sub	GCP-native messaging

Should Beginners Learn Streaming?

Not immediately. Most entry-level roles focus on batch processing. Streaming is an intermediate to advanced skill.

Understand the concepts, but prioritize batch pipelines first.

DevOps and Infrastructure

Modern data engineers don't just write pipelines. They deploy and maintain them.

Essential Skills

Tool	Purpose
Git	Version control — absolutely essential
Docker	Containerization — run anywhere
Terraform	Infrastructure as code
CI/CD	Automated testing and deployment

How Deep Should You Go?

You don't need to become a DevOps engineer. But you should be able to:

Use Git confidently
Write a basic Dockerfile
Understand CI/CD pipelines
Read infrastructure code

The Modern Data Stack

You'll hear this term often. It refers to a common combination of tools:

Ingestion:    Fivetran, Airbyte, Stitch
Storage:      Snowflake, BigQuery, Databricks
Transform:    dbt
Orchestrate:  Airflow, Prefect, dbt Cloud
Visualize:    Looker, Tableau, Metabase

This stack emphasizes:

Cloud-native tools
ELT over ETL
SQL-first transformations
Managed services over self-hosted

What to Learn First: A Priority List

If I were starting over today, here's my order:

SQL — Master it
Python — Get comfortable
Git — Learn the basics
One cloud warehouse — Snowflake or BigQuery
Airflow — Understand orchestration
dbt — Modern transformation
One cloud platform — AWS, GCP, or Azure
Docker — Containerization basics
Spark — When you need scale

Don't try to learn everything at once. Build depth, then breadth.

Tools I Tell Beginners to Ignore (For Now)

Kubernetes — Overkill for most starting out
Hadoop — Legacy, rarely used in new projects
Every new framework that launches — Wait for adoption
No-code tools — Learn the fundamentals first

What's Next?

You now have a map of the data engineering toolkit. In the next article, we'll cover something often overlooked:

The mathematics behind data engineering — what you actually need to know, without the academic fluff.

Series Overview

Data Engineering Uncovered: What It Is and Why It Matters
Pipelines, ETL, and Warehouses: The DNA of Data Engineering
Tools of the Trade: What Powers Modern Data Engineering (You are here)
The Math You Actually Need as a Data Engineer
Building Your First Pipeline: From Concept to Execution
Charting Your Path: Courses and Resources to Accelerate Your Journey

Have questions about which tools to prioritize? Drop them in the comments.

Top comments (1)

Jessica Aki • Jan 28

Well, My journey is just beginning🫠 I knew there was a lot to learn but I didn't know it was this much. Thank you for helping me streamline it.

In your opinion tho, what's the best way to learn? A bootcamp that teaches all of these? Picking and patching from youtube? Udemy or coursera courses? datacamp?

I'm currently learning on my own, I've started with tackling SQL, not everything but enough to be somewhat confident in writing queries. I started with freecodecamp's postgres course. I will say that's why I'm confident in writing some types of queries and in database designs but not all as the course didn't cover window functions, Common table expressions, query optimization and some other functions I learnt on my own like case, stored procedures, even rollup and cube. So I know I have to find ways of supplementing these gaps.

I also recently got an opportunity to learn through datacamp for free and I know they're very well known for their data courses , so to some extent I can say it's a good choice but can I get your opinion on this.