Vinicius Fagundes

Posted on Jan 31

The Math You Actually Need as a Data Engineer

#datascience #career #beginners #python

Introduction

Let's address the elephant in the room.

"Do I need to be good at math to become a data engineer?"

I've heard this question hundreds of times from students and career changers. The answer might surprise you:

You need less math than you think — but more than zero.

Data engineering is not data science. You won't be deriving gradient descent formulas or proving statistical theorems. But you will encounter mathematical concepts daily, often without realizing it.

In this article, I'll cover exactly what you need to know — practical mathematics that makes you a better engineer, without the academic overhead.

The Math Data Engineers Actually Use

Let me be direct. Here's what matters:

Concept	How Often You'll Use It
Set Theory	Daily
Boolean Logic	Daily
Basic Statistics	Weekly
Aggregations & Arithmetic	Daily
Probability Basics	Occasionally
Linear Algebra	Rarely
Calculus	Almost never

If you struggled with advanced mathematics in school, don't let that stop you. Most data engineering math is intuitive once you see how it applies.

Set Theory: The Foundation of SQL

Every SQL query you write is set theory in action.

If you understand sets, you understand SQL. It's that fundamental.

What Is a Set?

A set is a collection of distinct elements.

Set A = {1, 2, 3, 4, 5}
Set B = {4, 5, 6, 7, 8}

Set Operations

These map directly to SQL:

Set Operation	SQL Equivalent	Result
Union	UNION	All elements in A or B
Intersection	INNER JOIN	Elements in both A and B
Difference	EXCEPT / LEFT JOIN	Elements in A but not B
Cartesian Product	CROSS JOIN	All combinations of A and B

Practical Example

-- Union: Combine two sets
SELECT customer_id FROM online_orders
UNION
SELECT customer_id FROM store_orders;

-- Intersection: Find common elements
SELECT a.customer_id
FROM online_orders a
INNER JOIN store_orders b ON a.customer_id = b.customer_id;

-- Difference: Find elements only in first set
SELECT customer_id FROM online_orders
EXCEPT
SELECT customer_id FROM store_orders;

Venn Diagrams

Visualizing JOINs as Venn diagrams helps immensely:

INNER JOIN:     Only the overlap
LEFT JOIN:      All of left + overlap
RIGHT JOIN:     All of right + overlap
FULL OUTER:     Everything

If set theory clicks, SQL becomes intuitive.

Boolean Logic: The Language of Filters

Every WHERE clause, every IF statement, every CASE expression — it's all boolean logic.

The Basics

Operator	Meaning	Example
AND	Both must be true	`status = 'active' AND age > 18`
OR	At least one must be true	`country = 'US' OR country = 'CA'`
NOT	Inverts the condition	`NOT status = 'deleted'`

Truth Tables

Understanding how AND and OR behave:

AND
| A | B | A AND B |
|---|---|---------|
| T | T | T |
| T | F | F |
| F | T | F |
| F | F | F |

OR
| A | B | A OR B |
|---|---|--------|
| T | T | T |
| T | F | T |
| F | T | T |
| F | F | F |

De Morgan's Laws

This is where many engineers trip up. These laws are essential for complex filters:

NOT (A AND B) = (NOT A) OR (NOT B)
NOT (A OR B) = (NOT A) AND (NOT B)

Practical Example:

-- These are equivalent:
WHERE NOT (status = 'active' AND region = 'EU')
WHERE status != 'active' OR region != 'EU'

Understanding De Morgan's Laws helps you simplify complex conditions and debug filters that aren't working as expected.

NULL: The Third State

In databases, there's a third logical state: NULL (unknown).

This breaks normal boolean logic:

Expression	Result
TRUE AND NULL	NULL
FALSE AND NULL	FALSE
TRUE OR NULL	TRUE
FALSE OR NULL	NULL
NOT NULL	NULL
NULL = NULL	NULL (not TRUE!)

This is why WHERE column = NULL never works. Use WHERE column IS NULL instead.

Basic Statistics: Understanding Your Data

You don't need a statistics degree. But you must understand descriptive statistics.

Measures of Central Tendency

Measure	What It Tells You	SQL
Mean	Average value	`AVG(column)`
Median	Middle value	Varies by database
Mode	Most frequent value	`GROUP BY` + `ORDER BY COUNT`

When to Use Each

Mean — General average, but sensitive to outliers
Median — Better for skewed distributions (salaries, prices)
Mode — Best for categorical data

Example: If salaries are {50k, 55k, 60k, 65k, 500k}

Mean: 146k (misleading)
Median: 60k (more representative)

Measures of Spread

Measure	What It Tells You	SQL
Range	Max - Min	`MAX(col) - MIN(col)`
Variance	How spread out data is	`VAR(column)`
Standard Deviation	Spread in original units	`STDDEV(column)`

Percentiles and Quartiles

Essential for understanding distributions:

P25 (Q1) — 25% of data falls below this
P50 (Median) — 50% of data falls below this
P75 (Q3) — 75% of data falls below this
P99 — Common for performance metrics

-- PostgreSQL example
SELECT 
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY response_time) AS median,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY response_time) AS p99
FROM api_logs;

Aggregations: The Daily Bread

You'll write aggregations constantly. Know them cold.

Common Aggregation Functions

Function	Purpose
COUNT	Number of rows
SUM	Total of values
AVG	Mean value
MIN	Smallest value
MAX	Largest value
COUNT(DISTINCT)	Unique values

GROUP BY Logic

GROUP BY partitions your data into sets, then applies aggregations to each set.

SELECT 
    region,
    COUNT(*) AS order_count,
    SUM(amount) AS total_revenue,
    AVG(amount) AS avg_order_value
FROM orders
GROUP BY region;

Think of it as: "For each region, calculate these metrics."

Window Functions: Aggregations Without Grouping

Window functions let you aggregate while keeping individual rows:

SELECT 
    order_id,
    customer_id,
    amount,
    SUM(amount) OVER (PARTITION BY customer_id) AS customer_total,
    AVG(amount) OVER (PARTITION BY customer_id) AS customer_avg
FROM orders;

Understanding PARTITION BY is set theory again — you're defining subsets.

Probability Basics: When Data Gets Uncertain

You don't need deep probability theory. But some basics help.

Key Concepts

Concept	Definition
Probability	Likelihood of an event: 0 to 1
Independent Events	One event doesn't affect another
Conditional Probability	Probability given something else happened

Where This Appears

Data quality: "What percentage of records have missing values?"
Sampling: "Is this sample representative?"
A/B testing: "Is this result statistically significant?"

Basic Calculations

-- Probability of NULL values in a column
SELECT 
    COUNT(*) AS total_rows,
    SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) AS null_count,
    ROUND(SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END)::decimal / COUNT(*), 4) AS null_probability
FROM customers;

Data Types and Precision

Understanding how numbers are stored prevents costly mistakes.

Integer vs. Float

Type	Use Case	Watch Out For
INTEGER	Counts, IDs	Overflow at limits
FLOAT	Scientific data	Precision errors
DECIMAL	Money, exact values	Slower, more storage

Floating Point Precision

-- This might not equal 0.3
SELECT 0.1 + 0.2;
-- Result: 0.30000000000000004

For financial data, always use DECIMAL.

Division Pitfalls

-- Integer division truncates
SELECT 5 / 2;  -- Returns 2, not 2.5

-- Cast to get decimals
SELECT 5::decimal / 2;  -- Returns 2.5

Practical Math Patterns

Here are patterns I use constantly:

Calculating Percentages

SELECT 
    category,
    COUNT(*) AS count,
    ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) AS percentage
FROM products
GROUP BY category;

Year-over-Year Growth

SELECT 
    year,
    revenue,
    LAG(revenue) OVER (ORDER BY year) AS prev_year,
    ROUND(100.0 * (revenue - LAG(revenue) OVER (ORDER BY year)) 
        / LAG(revenue) OVER (ORDER BY year), 2) AS yoy_growth
FROM annual_revenue;

Running Totals

SELECT 
    date,
    amount,
    SUM(amount) OVER (ORDER BY date) AS running_total
FROM transactions;

Moving Averages

SELECT 
    date,
    value,
    AVG(value) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg
FROM daily_metrics;

Math You Can Skip (For Now)

As a data engineer, you can deprioritize:

Topic	Why You Can Skip It
Calculus	Leave it to data scientists
Linear Algebra	Only needed for ML engineering
Advanced Statistics	Beyond descriptive stats, not essential
Proofs and Theorems	You're building, not proving

If you move toward machine learning engineering later, revisit these. For core data engineering, they're not required.

How to Practice

The best way to learn math for data engineering is through SQL:

Take a public dataset — Kaggle has plenty
Ask questions — "What's the median order value by region?"
Write queries — Apply the concepts above
Verify results — Do they make logical sense?

Math becomes intuitive when tied to real problems.

Key Takeaways

Set theory — Understand it, and SQL makes sense
Boolean logic — Every filter depends on it
Descriptive statistics — Mean, median, percentiles
Aggregations — Your daily tools
Precision — Know your data types

You don't need to love math. You need to respect it enough to get the fundamentals right.

What's Next?

Theory only takes you so far. In the next article, we put everything together:

Building your first data pipeline — from concept to working code.

Series Overview

Data Engineering Uncovered: What It Is and Why It Matters
Pipelines, ETL, and Warehouses: The DNA of Data Engineering
Tools of the Trade: What Powers Modern Data Engineering
The Math You Actually Need as a Data Engineer (You are here)
Building Your First Pipeline: From Concept to Execution
Charting Your Path: Courses and Resources to Accelerate Your Journey

Found this helpful? Have questions about applying these concepts? Drop them in the comments.