Vinicius Fagundes

Posted on Jun 16

SQL or Python? The line is sharper than you think (with code)

#python #scala #sql #data

Most engineers reach for Python first. Wrong instinct.

I've spent 17 years inside other people's pipelines — financial services, payments, ecommerce, gaming — and the same bug keeps showing up wearing two different costumes. Either someone wrote a Python loop doing what a single window function solves in milliseconds, or they buried a procedural decision tree inside a 60-line CASE statement that no one can test or read.

Same root cause. Wrong language for the job.

So let me give you the rule I actually teach, and then earn it with code — the cases where Python turns simple SQL into a nightmare, and the cases where SQL turns simple Python into one.

The one-line rule

If you're declaring what data you want, use SQL. If you're describing how to transform it, step by step, use Python.

SQL is declarative and set-at-once. You describe the shape of the result and the engine figures out the execution plan — joins, memory, spill to disk, parallelism across cores. You don't manage any of that.

Python is imperative and row-by-row. You control the algorithm and you can reach the outside world — an API, a model, a file on S3. But you also own the memory, the ordering, and the state. Nobody parallelizes it for you.

That's the whole fork. Everything below is just consequences.

Part 1 — When SQL is trivial and Python fights you

These are the set operations. Joins, aggregations, ranking, windows. The moment you find yourself writing a loop to do one of these, stop.

1. "Give me the latest row per customer"

Dedup-to-latest is the single most common request I see done badly.

-- Snowflake / BigQuery / Databricks / DuckDB
SELECT *
FROM orders
QUALIFY ROW_NUMBER() OVER (
    PARTITION BY customer_id
    ORDER BY updated_at DESC
) = 1;

One window function. QUALIFY filters on the window result directly — no subquery, no extra CTE. (On Postgres/MySQL/SQL Server, which don't have QUALIFY yet, you wrap the ROW_NUMBER() in a CTE and filter WHERE rn = 1. Same idea, one more line.)

Now Python, no pandas:

latest = {}
for row in orders:
    cid = row["customer_id"]
    current = latest.get(cid)
    if current is None or row["updated_at"] > current["updated_at"]:
        latest[cid] = row

result = list(latest.values())

Honestly? That's fine for one key and one tie-breaker. It's readable. But watch what happens when reality shows up: now you dedup on (customer_id, region) and the tie-break is updated_at DESC, id DESC. In SQL you add two words to the ORDER BY. In Python you're now writing comparison branches by hand, and that's where the bugs move in.

The SQL doesn't get more complex as the rule gets more complex. The Python does.

2. Moving averages and running totals

Trailing average per account, last 7 rows:

SELECT
    account_id,
    event_date,
    amount,
    AVG(amount) OVER (
        PARTITION BY account_id
        ORDER BY event_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS trailing_avg
FROM transactions;

One frame spec. (For a true time-based window — "last 7 calendar days" — you use RANGE ... INTERVAL where the dialect supports it; ROWS counts rows, not days.)

The pure-Python version has to sort, group by partition, and hand-roll a sliding window:

from collections import defaultdict, deque

by_account = defaultdict(list)
for row in sorted(transactions, key=lambda r: (r["account_id"], r["event_date"])):
    by_account[row["account_id"]].append(row)

results = []
for account_id, rows in by_account.items():
    window = deque(maxlen=7)          # auto-evicts the oldest
    for row in rows:
        window.append(row["amount"])
        avg = sum(window) / len(window)   # len() handles the partial window at the start
        results.append({**row, "trailing_avg": avg})

It works. But look at everything you just took ownership of: the sort, the partitioning, the eviction, the partial-window edge case at the start of each group. The SQL engine did all of that for free and ran every partition in parallel and spilled to disk if the data didn't fit in memory. Your Python loaded the whole thing into one machine's RAM and ran single-threaded.

3. Sessionization — the gaps-and-islands problem

Group events into sessions, where a gap over 30 minutes starts a new one. This is the example I use in class because it looks scary in SQL and isn't.

WITH marked AS (
    SELECT
        user_id,
        event_time,
        CASE
            WHEN DATEDIFF('second',
                          LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time),
                          event_time) > 1800
              OR LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) IS NULL
            THEN 1 ELSE 0
        END AS is_new_session
    FROM events
),
numbered AS (
    SELECT
        user_id,
        event_time,
        SUM(is_new_session) OVER (PARTITION BY user_id ORDER BY event_time) AS session_id
    FROM marked
)
SELECT
    user_id,
    session_id,
    MIN(event_time) AS session_start,
    MAX(event_time) AS session_end,
    COUNT(*)        AS event_count
FROM numbered
GROUP BY user_id, session_id;

(The DATEDIFF signature is the part that moves between dialects: Snowflake DATEDIFF('second', a, b), Databricks timestampdiff(SECOND, a, b), Postgres EXTRACT(EPOCH FROM b - a).)

The trick is the running SUM over a 0/1 flag — every time a new session starts, the cumulative sum ticks up, and that number becomes the session id. No loop. The engine does it across every user at once.

Here's the Python:

from collections import defaultdict

events_by_user = defaultdict(list)
for e in events:
    events_by_user[e["user_id"]].append(e)

sessions = []
for user_id, evs in events_by_user.items():
    evs.sort(key=lambda e: e["event_time"])
    prev = None
    current = None
    for e in evs:
        if prev is None or (e["event_time"] - prev).total_seconds() > 1800:
            current = {
                "user_id": user_id,
                "session_start": e["event_time"],
                "session_end": e["event_time"],
                "event_count": 0,
            }
            sessions.append(current)
        current["session_end"] = e["event_time"]
        current["event_count"] += 1
        prev = e["event_time"]

And now I'll say the thing most "SQL vs Python" posts won't: this Python is arguably more readable than that two-CTE SQL, if you've never seen the gaps-and-islands pattern. So readability isn't the argument here. Scale and locality are.

The SQL runs next to the data, parallel across users, spilling to disk when it overflows. The Python pulls every event into one process and runs it on one core. At a million rows you won't notice. At two billion, the Python is a 3 a.m. page and the SQL is a coffee break. That difference — where the work physically runs — is the real reason to keep set operations in SQL, not the line count.

Part 2 — When Python is trivial and SQL fights you

Flip it. The moment your logic needs the outside world, branches hard, or depends on a value it just computed, you've left set-operation territory. Forcing it back into SQL is how you get the unreadable monsters.

4. Enriching each row from an API or a model

import requests

def enrich(rows):
    for row in rows:
        resp = requests.get(
            "https://api.example.com/company",
            params={"domain": row["email_domain"]},
            timeout=5,
        )
        data = resp.json() if resp.ok else {}
        row["company_size"] = data.get("employees")
        row["industry"] = data.get("industry")
    return rows

SQL has no clean answer for this. You can wrap the call in a UDF, but now you've smuggled a network round-trip into a warehouse scan and you pay for it on every row, every run, with no easy retry or backoff. The same is true for running a model per row, or reading a file per record. The instant a row needs to talk to something outside the database, the database is the wrong layer.

5. Many independent business rules

Shipping fee: domestic vs international, member discount, express surcharge, weekend surcharge, an oversized-international case that needs a manual quote. Five rules that combine. In SQL this becomes a CASE tree, and the trap is that the rules multiply:

SELECT
    order_id,
    CASE
        WHEN country NOT IN ('BR','US','CA') AND weight_kg > 30 THEN NULL  -- manual quote
        WHEN is_member AND express     THEN base_rate * 0.9 + 12 + 0.5 * weight_kg
        WHEN is_member AND NOT express THEN base_rate * 0.9
        WHEN express                   THEN base_rate + 12 + 0.5 * weight_kg
        ELSE base_rate
    END AS shipping_fee   -- now add the weekend surcharge to FOUR of these branches...
FROM orders;

Every new independent rule doubles the branch count, and a rule like "weekend surcharge" has to be pasted into several branches at once. That's a bug farm.

In Python the rules stay independent and additive:

DOMESTIC = {"BR", "US", "CA"}
WEEKEND_SURCHARGE = 8.0

def shipping_fee(order):
    if order["country"] not in DOMESTIC and order["weight_kg"] > 30:
        return None  # oversized international -> manual quote

    fee = order["base_rate"]
    if order["is_member"]:
        fee *= 0.9
    if order["express"]:
        fee += 12 + 0.5 * order["weight_kg"]
    if order["created_at"].weekday() >= 5:   # Saturday or Sunday
        fee += WEEKEND_SURCHARGE
    return round(fee, 2)

Each rule is one line and doesn't touch the others. And here's the part that settles it for me: you can unit-test this.

def test_member_express_weekend():
    order = {"country": "BR", "weight_kg": 2, "base_rate": 100,
             "is_member": True, "express": True,
             "created_at": datetime(2026, 6, 13, 10, 0)}  # a Saturday
    assert shipping_fee(order) == 100 * 0.9 + 12 + 1 + 8

You can't pytest a CASE expression. When the logic is the product, that's not a small thing.

6. When row N depends on the computed value of row N-1

This is the cleanest boundary in the whole post.

Running balance, floored at zero — the balance can never go negative, and a withdrawal that would overdraw just lands you at zero. The obvious SQL is a cumulative sum:

SELECT ts, amount,
       SUM(amount) OVER (ORDER BY ts) AS balance   -- WRONG once there's a floor
FROM ledger;

It's wrong, and it's wrong in an interesting way. A plain window SUM adds up the raw inputs. But the floor makes each step depend on the clamped result of the previous step, not the raw number. Window functions can see the prior row's input; they can't see their own prior output. So the moment the computation feeds itself, the window function can't express it.

Pure SQL forces you into a recursive CTE:

WITH ordered AS (
    SELECT ts, amount, ROW_NUMBER() OVER (ORDER BY ts) AS rn FROM ledger
),
walk AS (
    SELECT rn, ts, amount, GREATEST(amount, 0) AS balance
    FROM ordered WHERE rn = 1
    UNION ALL
    SELECT o.rn, o.ts, o.amount, GREATEST(w.balance + o.amount, 0) AS balance
    FROM walk w
    JOIN ordered o ON o.rn = w.rn + 1
)
SELECT ts, amount, balance FROM walk ORDER BY ts;

That's correct. It also processes one row per recursion step — no parallelism — and on a few million rows it gets slow or trips the recursion limit.

The Python is three lines:

balance = 0.0
out = []
for tx in sorted(ledger, key=lambda t: t["ts"]):
    balance = max(balance + tx["amount"], 0.0)
    out.append({**tx, "balance": balance})

The principle: when each row depends on a value you computed for the row before — not its raw input, the computed result — you're doing iteration, not set reduction. SQL can technically limp through it. It's still the wrong tool.

The cheat sheet

What the problem looks like	Reach for
Join / filter / aggregate across many rows	SQL
Rank, dedup, "latest per group"	SQL — window + `QUALIFY`
Running totals, moving averages, sessionization	SQL — window functions
You're declaring what you want	SQL
Per-row call to an API, a model, or a file	Python
Many independent business rules	Python
Logic that needs to be unit-tested in isolation	Python
Row N depends on the computed value of row N-1	Python
You're describing how to transform it, step by step	Python

How I actually decide in production

Push the set reduction as far upstream as it'll go. Join, filter, and aggregate in SQL — in the warehouse, in dbt — where it's parallel and cheap and the optimizer does the hard part. Shrink the data to the smallest correct set before it ever reaches Python. Then drop into Python only for the procedural tail: the enrichment, the branching, the model call, the stateful walk.

The internet will tell you one of these is "faster." That's the lazy framing. They're not competing — they're a relay. The skill is knowing exactly where the baton changes hands.

One number to make it concrete. A pipeline I inherited did its core join as a Python row-loop — nested iteration over data the warehouse could have joined in its sleep. Rewritten as a single SQL join plus one window pass, same data, same cluster, it dropped from 8 hours to 47 minutes. Nothing got smarter. The work just moved to the layer it belonged in.

So before you write the loop, ask the only question that matters: am I declaring what I want, or describing how to build it? Get that right and the language picks itself.

What's the worst case you've seen — a Python loop that should've been one query, or a CASE statement that should've been a function? I want the war story.

I'm Vinicius Fagundes — principal data engineer, independent, and an MBA lecturer in São Paulo. I rebuild slow, expensive, and fragile data pipelines for a living. If your stack looks like any of the above, this is the work I do at vf-insights.com.

DEV Community