DEV Community

Vinicius Fagundes
Vinicius Fagundes

Posted on

The Math You Actually Need as a Data Engineer

Introduction

Let's address the elephant in the room.

"Do I need to be good at math to become a data engineer?"

I've heard this question hundreds of times from students and career changers. The answer might surprise you:

You need less math than you think — but more than zero.

Data engineering is not data science. You won't be deriving gradient descent formulas or proving statistical theorems. But you will encounter mathematical concepts daily, often without realizing it.

In this article, I'll cover exactly what you need to know — practical mathematics that makes you a better engineer, without the academic overhead.


The Math Data Engineers Actually Use

Let me be direct. Here's what matters:

Concept How Often You'll Use It
Set Theory Daily
Boolean Logic Daily
Basic Statistics Weekly
Aggregations & Arithmetic Daily
Probability Basics Occasionally
Linear Algebra Rarely
Calculus Almost never

If you struggled with advanced mathematics in school, don't let that stop you. Most data engineering math is intuitive once you see how it applies.


Set Theory: The Foundation of SQL

Every SQL query you write is set theory in action.

If you understand sets, you understand SQL. It's that fundamental.

What Is a Set?

A set is a collection of distinct elements.

Set A = {1, 2, 3, 4, 5}
Set B = {4, 5, 6, 7, 8}
Enter fullscreen mode Exit fullscreen mode

Set Operations

These map directly to SQL:

Set Operation SQL Equivalent Result
Union UNION All elements in A or B
Intersection INNER JOIN Elements in both A and B
Difference EXCEPT / LEFT JOIN Elements in A but not B
Cartesian Product CROSS JOIN All combinations of A and B

Practical Example

-- Union: Combine two sets
SELECT customer_id FROM online_orders
UNION
SELECT customer_id FROM store_orders;

-- Intersection: Find common elements
SELECT a.customer_id
FROM online_orders a
INNER JOIN store_orders b ON a.customer_id = b.customer_id;

-- Difference: Find elements only in first set
SELECT customer_id FROM online_orders
EXCEPT
SELECT customer_id FROM store_orders;
Enter fullscreen mode Exit fullscreen mode

Venn Diagrams

Visualizing JOINs as Venn diagrams helps immensely:

INNER JOIN:     Only the overlap
LEFT JOIN:      All of left + overlap
RIGHT JOIN:     All of right + overlap
FULL OUTER:     Everything
Enter fullscreen mode Exit fullscreen mode

If set theory clicks, SQL becomes intuitive.


Boolean Logic: The Language of Filters

Every WHERE clause, every IF statement, every CASE expression — it's all boolean logic.

The Basics

Operator Meaning Example
AND Both must be true status = 'active' AND age > 18
OR At least one must be true country = 'US' OR country = 'CA'
NOT Inverts the condition NOT status = 'deleted'

Truth Tables

Understanding how AND and OR behave:

AND
| A | B | A AND B |
|---|---|---------|
| T | T | T |
| T | F | F |
| F | T | F |
| F | F | F |

OR
| A | B | A OR B |
|---|---|--------|
| T | T | T |
| T | F | T |
| F | T | T |
| F | F | F |

De Morgan's Laws

This is where many engineers trip up. These laws are essential for complex filters:

NOT (A AND B) = (NOT A) OR (NOT B)
NOT (A OR B) = (NOT A) AND (NOT B)
Enter fullscreen mode Exit fullscreen mode

Practical Example:

-- These are equivalent:
WHERE NOT (status = 'active' AND region = 'EU')
WHERE status != 'active' OR region != 'EU'
Enter fullscreen mode Exit fullscreen mode

Understanding De Morgan's Laws helps you simplify complex conditions and debug filters that aren't working as expected.

NULL: The Third State

In databases, there's a third logical state: NULL (unknown).

This breaks normal boolean logic:

Expression Result
TRUE AND NULL NULL
FALSE AND NULL FALSE
TRUE OR NULL TRUE
FALSE OR NULL NULL
NOT NULL NULL
NULL = NULL NULL (not TRUE!)

This is why WHERE column = NULL never works. Use WHERE column IS NULL instead.


Basic Statistics: Understanding Your Data

You don't need a statistics degree. But you must understand descriptive statistics.

Measures of Central Tendency

Measure What It Tells You SQL
Mean Average value AVG(column)
Median Middle value Varies by database
Mode Most frequent value GROUP BY + ORDER BY COUNT

When to Use Each

  • Mean — General average, but sensitive to outliers
  • Median — Better for skewed distributions (salaries, prices)
  • Mode — Best for categorical data

Example: If salaries are {50k, 55k, 60k, 65k, 500k}

  • Mean: 146k (misleading)
  • Median: 60k (more representative)

Measures of Spread

Measure What It Tells You SQL
Range Max - Min MAX(col) - MIN(col)
Variance How spread out data is VAR(column)
Standard Deviation Spread in original units STDDEV(column)

Percentiles and Quartiles

Essential for understanding distributions:

  • P25 (Q1) — 25% of data falls below this
  • P50 (Median) — 50% of data falls below this
  • P75 (Q3) — 75% of data falls below this
  • P99 — Common for performance metrics
-- PostgreSQL example
SELECT 
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY response_time) AS median,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY response_time) AS p99
FROM api_logs;
Enter fullscreen mode Exit fullscreen mode

Aggregations: The Daily Bread

You'll write aggregations constantly. Know them cold.

Common Aggregation Functions

Function Purpose
COUNT Number of rows
SUM Total of values
AVG Mean value
MIN Smallest value
MAX Largest value
COUNT(DISTINCT) Unique values

GROUP BY Logic

GROUP BY partitions your data into sets, then applies aggregations to each set.

SELECT 
    region,
    COUNT(*) AS order_count,
    SUM(amount) AS total_revenue,
    AVG(amount) AS avg_order_value
FROM orders
GROUP BY region;
Enter fullscreen mode Exit fullscreen mode

Think of it as: "For each region, calculate these metrics."

Window Functions: Aggregations Without Grouping

Window functions let you aggregate while keeping individual rows:

SELECT 
    order_id,
    customer_id,
    amount,
    SUM(amount) OVER (PARTITION BY customer_id) AS customer_total,
    AVG(amount) OVER (PARTITION BY customer_id) AS customer_avg
FROM orders;
Enter fullscreen mode Exit fullscreen mode

Understanding PARTITION BY is set theory again — you're defining subsets.


Probability Basics: When Data Gets Uncertain

You don't need deep probability theory. But some basics help.

Key Concepts

Concept Definition
Probability Likelihood of an event: 0 to 1
Independent Events One event doesn't affect another
Conditional Probability Probability given something else happened

Where This Appears

  • Data quality: "What percentage of records have missing values?"
  • Sampling: "Is this sample representative?"
  • A/B testing: "Is this result statistically significant?"

Basic Calculations

-- Probability of NULL values in a column
SELECT 
    COUNT(*) AS total_rows,
    SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END) AS null_count,
    ROUND(SUM(CASE WHEN email IS NULL THEN 1 ELSE 0 END)::decimal / COUNT(*), 4) AS null_probability
FROM customers;
Enter fullscreen mode Exit fullscreen mode

Data Types and Precision

Understanding how numbers are stored prevents costly mistakes.

Integer vs. Float

Type Use Case Watch Out For
INTEGER Counts, IDs Overflow at limits
FLOAT Scientific data Precision errors
DECIMAL Money, exact values Slower, more storage

Floating Point Precision

-- This might not equal 0.3
SELECT 0.1 + 0.2;
-- Result: 0.30000000000000004
Enter fullscreen mode Exit fullscreen mode

For financial data, always use DECIMAL.

Division Pitfalls

-- Integer division truncates
SELECT 5 / 2;  -- Returns 2, not 2.5

-- Cast to get decimals
SELECT 5::decimal / 2;  -- Returns 2.5
Enter fullscreen mode Exit fullscreen mode

Practical Math Patterns

Here are patterns I use constantly:

Calculating Percentages

SELECT 
    category,
    COUNT(*) AS count,
    ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) AS percentage
FROM products
GROUP BY category;
Enter fullscreen mode Exit fullscreen mode

Year-over-Year Growth

SELECT 
    year,
    revenue,
    LAG(revenue) OVER (ORDER BY year) AS prev_year,
    ROUND(100.0 * (revenue - LAG(revenue) OVER (ORDER BY year)) 
        / LAG(revenue) OVER (ORDER BY year), 2) AS yoy_growth
FROM annual_revenue;
Enter fullscreen mode Exit fullscreen mode

Running Totals

SELECT 
    date,
    amount,
    SUM(amount) OVER (ORDER BY date) AS running_total
FROM transactions;
Enter fullscreen mode Exit fullscreen mode

Moving Averages

SELECT 
    date,
    value,
    AVG(value) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS seven_day_avg
FROM daily_metrics;
Enter fullscreen mode Exit fullscreen mode

Math You Can Skip (For Now)

As a data engineer, you can deprioritize:

Topic Why You Can Skip It
Calculus Leave it to data scientists
Linear Algebra Only needed for ML engineering
Advanced Statistics Beyond descriptive stats, not essential
Proofs and Theorems You're building, not proving

If you move toward machine learning engineering later, revisit these. For core data engineering, they're not required.


How to Practice

The best way to learn math for data engineering is through SQL:

  1. Take a public dataset — Kaggle has plenty
  2. Ask questions — "What's the median order value by region?"
  3. Write queries — Apply the concepts above
  4. Verify results — Do they make logical sense?

Math becomes intuitive when tied to real problems.


Key Takeaways

  • Set theory — Understand it, and SQL makes sense
  • Boolean logic — Every filter depends on it
  • Descriptive statistics — Mean, median, percentiles
  • Aggregations — Your daily tools
  • Precision — Know your data types

You don't need to love math. You need to respect it enough to get the fundamentals right.


What's Next?

Theory only takes you so far. In the next article, we put everything together:

Building your first data pipeline — from concept to working code.


Series Overview

  1. Data Engineering Uncovered: What It Is and Why It Matters
  2. Pipelines, ETL, and Warehouses: The DNA of Data Engineering
  3. Tools of the Trade: What Powers Modern Data Engineering
  4. The Math You Actually Need as a Data Engineer (You are here)
  5. Building Your First Pipeline: From Concept to Execution
  6. Charting Your Path: Courses and Resources to Accelerate Your Journey

Found this helpful? Have questions about applying these concepts? Drop them in the comments.

Top comments (0)