Introduction
Let's address the elephant in the room.
"Do I need to be good at math to become a data engineer?"
I've heard this question hundreds of times from students and career changers. The answer might surprise you:
You need less math than you think — but more than zero.
Data engineering is not data science. You won't be deriving gradient descent formulas or proving statistical theorems. But you will encounter mathematical concepts daily, often without realizing it.
In this article, I'll cover exactly what you need to know — practical mathematics that makes you a better engineer, without the academic overhead.
The Math Data Engineers Actually Use
Let me be direct. Here's what matters:
| Concept | How Often You'll Use It |
|---|---|
| Set Theory | Daily |
| Boolean Logic | Daily |
| Basic Statistics | Weekly |
| Aggregations & Arithmetic | Daily |
| Probability Basics | Occasionally |
| Linear Algebra | Rarely |
| Calculus | Almost never |
If you struggled with advanced mathematics in school, don't let that stop you. Most data engineering math is intuitive once you see how it applies.
Set Theory: The Foundation of SQL
Every SQL query you write is set theory in action.
If you understand sets, you understand SQL. It's that fundamental.
What Is a Set?
A set is a collection of distinct elements.
Set A = {1, 2, 3, 4, 5}
Set B = {4, 5, 6, 7, 8}
Set Operations
These map directly to SQL:
| Set Operation | SQL Equivalent | Result |
|---|---|---|
| Union | UNION | All elements in A or B |
| Intersection | INNER JOIN | Elements in both A and B |
| Difference | EXCEPT / LEFT JOIN | Elements in A but not B |
| Cartesian Product | CROSS JOIN | All combinations of A and B |
Practical Example
-- Union: Combine two sets
SELECT customer_id FROM online_orders
UNION
SELECT customer_id FROM store_orders;
-- Intersection: Find common elements
SELECT a.customer_id
FROM online_orders a
INNER JOIN store_orders b ON a.customer_id = b.customer_id;
-- Difference: Find elements only in first set
SELECT customer_id FROM online_orders
EXCEPT
SELECT customer_id FROM store_orders;
Venn Diagrams
Visualizing JOINs as Venn diagrams helps immensely:
INNER JOIN: Only the overlap
LEFT JOIN: All of left + overlap
RIGHT JOIN: All of right + overlap
FULL OUTER: Everything
If set theory clicks, SQL becomes intuitive.
Boolean Logic: The Language of Filters
Every WHERE clause, every IF statement, every CASE expression — it's all boolean logic.
The Basics
| Operator | Meaning | Example |
|---|---|---|
| AND | Both must be true | status = 'active' AND age > 18 |
| OR | At least one must be true | country = 'US' OR country = 'CA' |
| NOT | Inverts the condition | NOT status = 'deleted' |
Truth Tables
Understanding how AND and OR behave:
AND
| A | B | A AND B |
|---|---|---------|
| T | T | T |
| T | F | F |
| F | T | F |
| F | F | F |
OR
| A | B | A OR B |
|---|---|--------|
| T | T | T |
| T | F | T |
| F | T | T |
| F | F | F |
De Morgan's Laws
This is where many engineers trip up. These laws are essential for complex filters:
NOT (A AND B) = (NOT A) OR (NOT B)
NOT (A OR B) = (NOT A) AND (NOT B)
Practical Example:
-- These are equivalent:
WHERE NOT (status = 'active' AND region = 'EU')
WHERE status != 'active' OR region != 'EU'
Understanding De Morgan's Laws helps you simplify complex conditions and debug filters that aren't working as expected.
NULL: The Third State
In databases, there's a third logical state: NULL (unknown).
This breaks normal boolean logic:
| Expression | Result |
|---|---|
| TRUE AND NULL | NULL |
| FALSE AND NULL | FALSE |
| TRUE OR NULL | TRUE |
| FALSE OR NULL | NULL |
| NOT NULL | NULL |
| NULL = NULL | NULL (not TRUE!) |
This is why WHERE column = NULL never works. Use WHERE column IS NULL instead.
Basic Statistics: Understanding Your Data
You don't need a statistics degree. But you must understand descriptive statistics.
Measures of Central Tendency
| Measure | What It Tells You | SQL |
|---|---|---|
| Mean | Average value | AVG(column) |
| Median | Middle value | Varies by database |
| Mode | Most frequent value |
GROUP BY + ORDER BY COUNT
|
When to Use Each
- Mean — General average, but sensitive to outliers
- Median — Better for skewed distributions (salaries, prices)
- Mode — Best for categorical data
Example: If salaries are {50k, 55k, 60k, 65k, 500k}
- Mean: 146k (misleading)
- Median: 60k (more representative)
Measures of Spread
| Measure | What It Tells You | SQL |
|---|---|---|
| Range | Max - Min | MAX(col) - MIN(col) |
| Variance | How spread out data is | VAR(column) |
| Standard Deviation | Spread in original units | STDDEV(column) |
Percentiles and Quartiles
Essential for understanding distributions:
- P25 (Q1) — 25% of data falls below this
- P50 (Median) — 50% of data falls below this
- P75 (Q3) — 75% of data falls below this
- P99 — Common for performance metrics
-- PostgreSQL example
SELECT
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY response_time) AS median,
PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY response_time) AS p99
FROM api_logs;
Aggregations: The Daily Bread
You'll write aggregations constantly. Know them cold.
Common Aggregation Functions
| Function | Purpose |
|---|---|
| COUNT | Number of rows |
| SUM | Total of values |
| AVG | Mean value |
| MIN | Smallest value |
| MAX | Largest value |
| COUNT(DISTINCT) | Unique values |
GROUP BY Logic
GROUP BY partitions your data into sets, then applies aggregations to each set.
SELECT
region,
COUNT(*) AS order_count,
SUM(amount) AS total_revenue,
AVG(amount) AS avg_order_value
FROM orders
GROUP BY region;
Think of it as: "For each region, calculate these metrics."
Window Functions: Aggregations Without Grouping
Window functions let you aggregate while keeping individual rows:
SELECT
order_id,
customer_id,
amount,
SUM(amount) OVER (PARTITION BY customer_id) AS customer_total,
AVG(amount) OVER (PARTITION BY customer_id) AS customer_avg
FROM orders;
Understanding PARTITION BY is set theory again — you're defining subsets.
Probability Basics: When Data Gets Uncertain
You don't need deep probability theory. But some basics help.
Key Concepts
| Concept | Definition |
|---|---|
| Probability | Likelihood of an event: 0 to 1 |
| Independent Events | One event doesn't affect another |
| Conditional Probability | Probability given something else happened |
Where This Appears
- Data quality: "What percentage of records have missing values?"
- Sampling: "Is this sample representative?"
- A/B testing: "Is this result statistically significant?"
Basic Calculations
-- Probability of NULL values in a column
SELECT
COUNT(*) AS total_rows,
SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) AS null_count,
ROUND(SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END)::decimal / COUNT(*), 4) AS null_probability
FROM customers;
Data Types and Precision
Understanding how numbers are stored prevents costly mistakes.
Integer vs. Float
| Type | Use Case | Watch Out For |
|---|---|---|
| INTEGER | Counts, IDs | Overflow at limits |
| FLOAT | Scientific data | Precision errors |
| DECIMAL | Money, exact values | Slower, more storage |
Floating Point Precision
-- This might not equal 0.3
SELECT 0.1 + 0.2;
-- Result: 0.30000000000000004
For financial data, always use DECIMAL.
Division Pitfalls
-- Integer division truncates
SELECT 5 / 2; -- Returns 2, not 2.5
-- Cast to get decimals
SELECT 5::decimal / 2; -- Returns 2.5
Practical Math Patterns
Here are patterns I use constantly:
Calculating Percentages
SELECT
category,
COUNT(*) AS count,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) AS percentage
FROM products
GROUP BY category;
Year-over-Year Growth
SELECT
year,
revenue,
LAG(revenue) OVER (ORDER BY year) AS prev_year,
ROUND(100.0 * (revenue - LAG(revenue) OVER (ORDER BY year))
/ LAG(revenue) OVER (ORDER BY year), 2) AS yoy_growth
FROM annual_revenue;
Running Totals
SELECT
date,
amount,
SUM(amount) OVER (ORDER BY date) AS running_total
FROM transactions;
Moving Averages
SELECT
date,
value,
AVG(value) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg
FROM daily_metrics;
Math You Can Skip (For Now)
As a data engineer, you can deprioritize:
| Topic | Why You Can Skip It |
|---|---|
| Calculus | Leave it to data scientists |
| Linear Algebra | Only needed for ML engineering |
| Advanced Statistics | Beyond descriptive stats, not essential |
| Proofs and Theorems | You're building, not proving |
If you move toward machine learning engineering later, revisit these. For core data engineering, they're not required.
How to Practice
The best way to learn math for data engineering is through SQL:
- Take a public dataset — Kaggle has plenty
- Ask questions — "What's the median order value by region?"
- Write queries — Apply the concepts above
- Verify results — Do they make logical sense?
Math becomes intuitive when tied to real problems.
Key Takeaways
- Set theory — Understand it, and SQL makes sense
- Boolean logic — Every filter depends on it
- Descriptive statistics — Mean, median, percentiles
- Aggregations — Your daily tools
- Precision — Know your data types
You don't need to love math. You need to respect it enough to get the fundamentals right.
What's Next?
Theory only takes you so far. In the next article, we put everything together:
Building your first data pipeline — from concept to working code.
Series Overview
- Data Engineering Uncovered: What It Is and Why It Matters
- Pipelines, ETL, and Warehouses: The DNA of Data Engineering
- Tools of the Trade: What Powers Modern Data Engineering
- The Math You Actually Need as a Data Engineer (You are here)
- Building Your First Pipeline: From Concept to Execution
- Charting Your Path: Courses and Resources to Accelerate Your Journey
Found this helpful? Have questions about applying these concepts? Drop them in the comments.
Top comments (0)