Brittany

Posted on Feb 22

Your ML Model Isn’t Wrong. Your SQL Probably Is.

#machinelearning #dataengineering #sql #mlops

Your churn model isn’t degrading because the algorithm is weak.

It might be degrading because of a JOIN.

I’ve seen teams spend weeks tuning hyperparameters, switching architectures, and debating feature importance — only to discover the real issue was upstream data logic.

Before you tune the model, check your SQL.

The Problem Most Teams Misdiagnose

When performance drops, we usually suspect:

Model drift

Hyperparameter tuning

Feature scaling

Algorithm choice

Those are valid concerns.

But machine learning models don’t invent patterns.

They learn from the data we feed them.

If the dataset is flawed, the model will faithfully learn those flaws.

Upstream data logic determines downstream model behavior.

Scenario: The “Failing” Churn Model

A churn prediction model starts underperforming.

Same architecture.
Same training pipeline.
Same evaluation framework.

Nothing obvious changed.

After investigation, the issue wasn’t model complexity.

It was this:

SELECT *
FROM customers c
JOIN orders o
ON c.customer_id = o.customer_id;

It looks harmless.

It runs fast.
It returns data.
It passes basic tests.

But customers with multiple orders are duplicated across rows.

High-activity users become unintentionally overweighted in the training dataset.

The model didn’t fail.

It did exactly what we told it to do.

Mistake #1: Duplicate Rows from JOINs

If your model expects one row per customer but your query returns one row per transaction, you’ve changed the learning problem.

The issue isn’t SQL skill — it’s granularity awareness.

A better approach:

SELECT
c.customer_id,
COUNT(o.order_id) AS total_orders
FROM customers c
LEFT JOIN orders o
ON c.customer_id = o.customer_id
GROUP BY c.customer_id;

Aggregate intentionally before training.

Define the learning unit.

Mistake #2: Silent NULL Handling

NULLs rarely crash pipelines.

They quietly distort them.

SELECT income
FROM customers;

If income contains NULLs and you don’t handle them deliberately, the model sees noise.

Even something simple like:

SELECT
COALESCE(income, 0) AS income
FROM customers;

forces you to define intent.

The important part isn’t the function.

It’s the decision.

Mistake #3: Distorted Aggregations

Global averages can hide meaningful segmentation.

SELECT AVG(transaction_amount)
FROM transactions;

It works.
It returns a number.
It feels reasonable.

But a model trained on broad aggregates may underperform in production because it lacks entity-level context.

Instead:

SELECT
customer_id,
AVG(transaction_amount) AS avg_txn
FROM transactions
GROUP BY customer_id;

Aggregation logic should reflect the model objective — not convenience.

Aggregation is feature construction.

Feature construction is model behavior.

The Bigger Pattern

Many ML failures blamed on “model accuracy” are actually upstream data logic issues.

Strong ML systems require strong SQL foundations.

Data pipelines are part of the model architecture — not just preprocessing.

Strong models are built on strong data contracts.

Before You Tune the Model, Ask:

Are joins intentional?

Is entity granularity clearly defined?

Are aggregations aligned with the objective?

Are NULLs handled deliberately?

Is the training dataset versioned?

Sometimes the biggest ML improvement isn’t a new model.

It’s a better query.

If you’d like to see the structured breakdown with examples and commentary, I documented it here:

👉 GitHub repository:
https://github.com/brie1807/sql-to-ml-pipeline-mistakes

DEV Community

Your ML Model Isn’t Wrong. Your SQL Probably Is.

Top comments (0)