De' Clerke

Posted on Jun 2

SQL Interview Problems for Data Engineers: 30 Patterns That Actually Get Asked

#sql #dataengineering #career #interview

Data engineering SQL interviews are not the same as software engineering SQL interviews. You will not be asked to find the second-highest salary or reverse a string. You will be asked to sessionize user events, detect data quality issues in a pipeline, build a cohort retention table, or find consecutive streaks in time-series data. The problems are more complex, the schemas are messier, and the expectation is that you have encountered these patterns in production, not just on LeetCode.

I built these 30 problems from the SQL patterns I use repeatedly across my pipeline projects: NSE equity data, Kenyan property listings, job postings, flight data, and financial transactions. Every problem here maps to something I have written in production.

The problems use six tables throughout. Memorise the schemas before any interview:

nse_trades    (trade_id, ticker, sector, trade_date, open_price,
               close_price, volume, market_cap)

flights       (flight_id, airline, origin, destination,
               departure_time, arrival_time, status, passengers)

listings      (listing_id, location, property_type, price,
               bedrooms, listed_date, sold_date, status)

transactions  (transaction_id, user_id, amount, category,
               created_at, status)  -- status: completed/failed/pending

employees     (employee_id, name, department, salary,
               hire_date, manager_id)

events        (event_id, user_id, event_type, created_at)

Easy: The Foundations

Problem 1: Top N Per Group

Question: Find the top 3 most traded stocks by total volume for each sector. Return sector, ticker, total_volume. Order by sector, then total_volume DESC.

WITH ranked AS (
    SELECT
        sector,
        ticker,
        SUM(volume)                                                      AS total_volume,
        RANK() OVER (PARTITION BY sector ORDER BY SUM(volume) DESC)      AS rnk
    FROM nse_trades
    GROUP BY sector, ticker
)
SELECT sector, ticker, total_volume
FROM ranked
WHERE rnk <= 3
ORDER BY sector, total_volume DESC;

What the interviewer is testing: Whether you know that window functions run after GROUP BY. You aggregate first to get total_volume per ticker, then rank within each sector. The common mistake is trying to apply the window function before grouping, which does not work because window functions are evaluated after GROUP BY in the SQL logical processing order.

Follow-up you will get: "What is the difference between RANK and DENSE_RANK?"

RANK leaves gaps after ties: 1, 1, 3, 4. DENSE_RANK does not: 1, 1, 2, 3. Use DENSE_RANK when you genuinely want "top 3 positions" even when multiple tickers tie for second place. Use RANK when you want the literal third-most-traded stock.

Problem 2: HAVING vs WHERE

Question: Find sectors where the average closing price across all trades is above 50. Show the count of distinct tickers per sector. Return sector, avg_close, ticker_count. Order by avg_close DESC.

SELECT
    sector,
    ROUND(AVG(close_price), 2)  AS avg_close,
    COUNT(DISTINCT ticker)       AS ticker_count
FROM nse_trades
GROUP BY sector
HAVING AVG(close_price) > 50
ORDER BY avg_close DESC;

What the interviewer is testing: The difference between WHERE and HAVING. WHERE filters rows before aggregation. HAVING filters groups after aggregation. You cannot use WHERE close_price > 50 here because that filters individual rows, not sector averages. Any interviewer asking a problem with an aggregate filter condition is testing whether you reach for HAVING.

Problem 3: Self Join

Question: List every employee with their manager's name. Include employees with no manager (the CEO/top level). Return employee_name, department, salary, manager_name.

SELECT
    e.name       AS employee_name,
    e.department,
    e.salary,
    m.name       AS manager_name
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.employee_id
ORDER BY e.department, e.name;

What the interviewer is testing: Self joins and the critical choice of LEFT JOIN vs INNER JOIN. Top-level employees have NULL in manager_id. An INNER JOIN silently drops them. LEFT JOIN keeps them with NULL in manager_name, which is correct.

Follow-up you will get: "How do you find employees who earn more than their manager?"

SELECT e.name, e.salary, m.name AS manager, m.salary AS manager_salary
FROM employees e
JOIN employees m ON e.manager_id = m.employee_id
WHERE e.salary > m.salary;

Here you use INNER JOIN because you only want employees who have a manager to compare against.

Problem 4: NULL Counting

Question: The listings table has NULL in sold_date for unsold properties. Calculate total listings, sold listings, unsold listings, and the sell-through rate as a percentage. Return one row.

SELECT
    COUNT(*)                                                    AS total_listings,
    COUNT(sold_date)                                            AS sold_listings,
    COUNT(*) - COUNT(sold_date)                                 AS unsold_listings,
    ROUND(COUNT(sold_date)::NUMERIC / COUNT(*) * 100, 1)        AS sell_through_pct
FROM listings;

What the interviewer is testing: The fundamental difference between COUNT(*) and COUNT(column). COUNT(*) counts all rows including those with NULLs. COUNT(column) counts only non-NULL values in that specific column. This lets you use COUNT(sold_date) as a conditional count without writing CASE WHEN sold_date IS NOT NULL THEN 1 END. In the Kenya Real Estate project, this exact pattern was how I tracked listing status across 1,338 properties.

Problem 5: Conditional Aggregation (Pivot)

Question: For each user in the transactions table, show how much they spent in each category: food, transport, utilities. Return user_id, food_total, transport_total, utilities_total.

SELECT
    user_id,
    SUM(CASE WHEN category = 'food'      THEN amount ELSE 0 END) AS food_total,
    SUM(CASE WHEN category = 'transport' THEN amount ELSE 0 END) AS transport_total,
    SUM(CASE WHEN category = 'utilities' THEN amount ELSE 0 END) AS utilities_total
FROM transactions
WHERE status = 'completed'
GROUP BY user_id
ORDER BY user_id;

What the interviewer is testing: Conditional aggregation, which is the SQL pivot pattern. PostgreSQL does not have a native PIVOT keyword. SUM(CASE WHEN ...) is the standard approach. This pattern appears in nearly every analytics interview. Know it cold.

Problem 6: Date Arithmetic

Question: For each completed flight, calculate the duration in minutes and flag flights over 180 minutes as 'long_haul'. Return flight_id, airline, origin, destination, duration_minutes, flight_type.

SELECT
    flight_id,
    airline,
    origin,
    destination,
    EXTRACT(EPOCH FROM (arrival_time - departure_time)) / 60   AS duration_minutes,
    CASE
        WHEN EXTRACT(EPOCH FROM (arrival_time - departure_time)) / 60 > 180
        THEN 'long_haul' ELSE 'short_haul'
    END                                                         AS flight_type
FROM flights
WHERE status = 'completed';

What the interviewer is testing: Timestamp arithmetic. Subtracting two TIMESTAMP values in PostgreSQL gives an INTERVAL. EXTRACT(EPOCH FROM interval) converts it to seconds. Divide by 60 for minutes. In BigQuery and Snowflake, you would use TIMESTAMP_DIFF(arrival_time, departure_time, MINUTE). Know both forms.

Problem 7: Deduplication

Question: The nse_trades table has duplicates due to a loading bug. Keep only the row with the highest volume per ticker + trade_date. Write a DELETE.

WITH ranked AS (
    SELECT trade_id,
        ROW_NUMBER() OVER (
            PARTITION BY ticker, trade_date
            ORDER BY volume DESC
        ) AS rn
    FROM nse_trades
)
DELETE FROM nse_trades
WHERE trade_id IN (SELECT trade_id FROM ranked WHERE rn > 1);

What the interviewer is testing: Using a CTE inside a DELETE to isolate the rows to remove. The ROW_NUMBER() approach is portable across PostgreSQL, BigQuery, and Snowflake. PostgreSQL also supports DISTINCT ON for this, but the ROW_NUMBER() version works everywhere. I used this exact pattern to clean the NSE trade dataset before loading it into the analytics schema.

Medium: The Patterns That Separate Candidates

Problem 8: Running Total and LAG

Question: For ticker 'SCOM', show each trade date with the closing price, the daily change in price, and the cumulative volume from the start of the year.

SELECT
    trade_date,
    close_price,
    close_price - LAG(close_price) OVER (ORDER BY trade_date)    AS daily_change,
    SUM(volume) OVER (
        ORDER BY trade_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    )                                                             AS cumulative_volume
FROM nse_trades
WHERE ticker = 'SCOM'
  AND EXTRACT(YEAR FROM trade_date) = 2025
ORDER BY trade_date;

What the interviewer is testing: Two window function patterns together. LAG(col) fetches the value from the previous row. The ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW frame on SUM() creates a running total. The frame clause is explicit here because the default frame for ordered windows (RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) can give unexpected results when there are ties in the ORDER BY column.

Problem 9: Moving Average with Partial Window Filtering

Question: Calculate a 7-day moving average of closing price for each ticker. Only show rows where a full 7-day window exists. Return ticker, trade_date, close_price, ma_7d.

WITH with_ma AS (
    SELECT
        ticker,
        trade_date,
        close_price,
        AVG(close_price) OVER (
            PARTITION BY ticker
            ORDER BY trade_date
            ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        )                                           AS ma_7d,
        COUNT(*) OVER (
            PARTITION BY ticker
            ORDER BY trade_date
            ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        )                                           AS window_size
    FROM nse_trades
)
SELECT ticker, trade_date, close_price, ROUND(ma_7d, 2) AS ma_7d
FROM with_ma
WHERE window_size = 7
ORDER BY ticker, trade_date;

What the interviewer is testing: Handling partial windows at the start of a time series. ROWS BETWEEN 6 PRECEDING AND CURRENT ROW includes the current row plus the 6 preceding. For the first 6 rows per ticker, the window has fewer than 7 rows, so the average is based on incomplete data. The COUNT(*) OVER trick with the same frame detects partial windows cleanly.

Problem 10: Year-Over-Year Comparison

Question: For each ticker, compare the average monthly closing price in 2025 vs 2024. Show the absolute difference and percentage change.

WITH monthly AS (
    SELECT
        ticker,
        EXTRACT(MONTH FROM trade_date)    AS month,
        EXTRACT(YEAR  FROM trade_date)    AS year,
        ROUND(AVG(close_price), 2)        AS avg_close
    FROM nse_trades
    WHERE EXTRACT(YEAR FROM trade_date) IN (2024, 2025)
    GROUP BY ticker,
             EXTRACT(MONTH FROM trade_date),
             EXTRACT(YEAR  FROM trade_date)
)
SELECT
    a.ticker,
    a.month,
    a.avg_close                                          AS avg_2024,
    b.avg_close                                          AS avg_2025,
    ROUND(b.avg_close - a.avg_close, 2)                  AS abs_change,
    ROUND((b.avg_close - a.avg_close)
          / NULLIF(a.avg_close, 0) * 100, 1)             AS pct_change
FROM monthly a
JOIN monthly b ON a.ticker = b.ticker AND a.month = b.month
WHERE a.year = 2024 AND b.year = 2025
ORDER BY a.ticker, a.month;

What the interviewer is testing: Self-joining a CTE on different year values. NULLIF(a.avg_close, 0) prevents a division-by-zero error if any 2024 average is zero, returning NULL instead of crashing. Always use NULLIF when dividing by user data.

Problem 11: Cohort Retention

Question: For each user, identify their first transaction month as their cohort. Then for each subsequent month, show how many users from that cohort were still active.

WITH first_txn AS (
    SELECT
        user_id,
        DATE_TRUNC('month', MIN(created_at)) AS cohort_month
    FROM transactions
    WHERE status = 'completed'
    GROUP BY user_id
),
monthly_activity AS (
    SELECT DISTINCT
        t.user_id,
        DATE_TRUNC('month', t.created_at) AS activity_month
    FROM transactions t
    WHERE t.status = 'completed'
)
SELECT
    f.cohort_month,
    EXTRACT(MONTH FROM AGE(m.activity_month, f.cohort_month)) AS months_since_start,
    COUNT(DISTINCT m.user_id)                                  AS active_users
FROM first_txn f
JOIN monthly_activity m ON f.user_id = m.user_id
GROUP BY f.cohort_month, m.activity_month
ORDER BY f.cohort_month, months_since_start;

What the interviewer is testing: Cohort analysis. The pattern is always the same: find each user's first event, then join back to all subsequent activity. DATE_TRUNC('month', ...) collapses timestamps to month granularity. AGE(later, earlier) gives the interval between months; EXTRACT(MONTH FROM ...) pulls the number. This is one of the most commonly asked product analytics problems in data engineering interviews.

Problem 12: Funnel Analysis

Question: The events table tracks 'page_view', 'add_to_cart', 'checkout', and 'purchase'. Find how many users reached each stage. A user counts at a stage only if they completed all prior stages.

WITH stages AS (
    SELECT user_id,
        MAX(CASE WHEN event_type = 'page_view'   THEN 1 ELSE 0 END) AS did_view,
        MAX(CASE WHEN event_type = 'add_to_cart' THEN 1 ELSE 0 END) AS did_cart,
        MAX(CASE WHEN event_type = 'checkout'    THEN 1 ELSE 0 END) AS did_checkout,
        MAX(CASE WHEN event_type = 'purchase'    THEN 1 ELSE 0 END) AS did_purchase
    FROM events
    GROUP BY user_id
),
funnel AS (
    SELECT
        SUM(did_view)                                                    AS views,
        SUM(CASE WHEN did_view=1 AND did_cart=1 THEN 1 END)             AS carts,
        SUM(CASE WHEN did_view=1 AND did_cart=1
                  AND did_checkout=1 THEN 1 END)                        AS checkouts,
        SUM(CASE WHEN did_view=1 AND did_cart=1
                  AND did_checkout=1 AND did_purchase=1 THEN 1 END)     AS purchases
    FROM stages
)
SELECT
    stage,
    users_reached,
    ROUND(users_reached::NUMERIC / LAG(users_reached) OVER (ORDER BY step) * 100, 1)
        AS conversion_from_prev
FROM (
    SELECT 1 AS step, 'page_view'   AS stage, views     AS users_reached FROM funnel
    UNION ALL
    SELECT 2,         'add_to_cart',           carts    FROM funnel
    UNION ALL
    SELECT 3,         'checkout',              checkouts FROM funnel
    UNION ALL
    SELECT 4,         'purchase',              purchases FROM funnel
) f
ORDER BY step;

What the interviewer is testing: Two things. First, MAX(CASE WHEN ...) as a flag-per-user pivot. Second, UNION ALL plus LAG() to compute step-to-step conversion rates. The funnel is an ordered set of stages, so LAG() applied to the ordered output gives conversion from the previous step.

Problem 13: Consecutive Streaks (The Islands Pattern)

Question: For each ticker, find the longest consecutive streak of days where the closing price increased.

WITH daily_change AS (
    SELECT
        ticker,
        trade_date,
        CASE WHEN close_price > LAG(close_price) OVER (PARTITION BY ticker ORDER BY trade_date)
             THEN 1 ELSE 0 END AS is_up
    FROM nse_trades
),
streak_groups AS (
    SELECT
        ticker,
        trade_date,
        is_up,
        ROW_NUMBER() OVER (PARTITION BY ticker ORDER BY trade_date)
        - ROW_NUMBER() OVER (PARTITION BY ticker, is_up ORDER BY trade_date) AS grp
    FROM daily_change
)
SELECT
    ticker,
    MAX(streak_len) AS longest_streak
FROM (
    SELECT ticker, grp, COUNT(*) AS streak_len
    FROM streak_groups
    WHERE is_up = 1
    GROUP BY ticker, grp
) streaks
GROUP BY ticker
ORDER BY longest_streak DESC;

What the interviewer is testing: The islands technique. It is one of the most consistently asked hard SQL patterns. The trick is subtracting two ROW_NUMBER() values: one ordered overall, one ordered within each group. For consecutive equal values, this subtraction stays constant, creating a unique group ID for each run. When the value changes, the subtraction shifts. Memorise this pattern. It comes up for streaks, sessions, consecutive activity, and any "how long did this condition hold" question.

Problem 14: LAST_VALUE Frame Clause Gotcha

Question: For each ticker, show the opening price on the first trading day of each month and the closing price on the last trading day of the month.

WITH monthly_bounds AS (
    SELECT
        ticker,
        DATE_TRUNC('month', trade_date)   AS year_month,
        FIRST_VALUE(open_price) OVER (
            PARTITION BY ticker, DATE_TRUNC('month', trade_date)
            ORDER BY trade_date
            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
        )                                  AS month_open,
        LAST_VALUE(close_price) OVER (
            PARTITION BY ticker, DATE_TRUNC('month', trade_date)
            ORDER BY trade_date
            ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
        )                                  AS month_close
    FROM nse_trades
)
SELECT DISTINCT
    ticker,
    year_month,
    month_open,
    month_close,
    ROUND((month_close - month_open) / NULLIF(month_open, 0) * 100, 2) AS monthly_return_pct
FROM monthly_bounds
ORDER BY ticker, year_month;

What the interviewer is testing: The LAST_VALUE frame clause trap. Without ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, LAST_VALUE only looks at the current row, because the default frame for an ordered window is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This is the single most common window function mistake I see. FIRST_VALUE works correctly with the default frame because it looks back to the beginning. LAST_VALUE does not, because the default frame stops at the current row, not the end of the partition.

Problem 15: Gap Detection in Time-Series

Question: Find all date gaps for ticker 'EQTY' in 2025. The table should have an entry for every weekday.

WITH trading_days AS (
    SELECT gs.dt::DATE AS expected_date
    FROM generate_series(
        '2025-01-01'::DATE,
        '2025-12-31'::DATE,
        '1 day'::INTERVAL
    ) AS gs(dt)
    WHERE EXTRACT(DOW FROM gs.dt) NOT IN (0, 6)
),
actual_days AS (
    SELECT DISTINCT trade_date FROM nse_trades WHERE ticker = 'EQTY'
)
SELECT t.expected_date AS missing_date
FROM trading_days t
LEFT JOIN actual_days a ON t.expected_date = a.trade_date
WHERE a.trade_date IS NULL
ORDER BY t.expected_date;

What the interviewer is testing: generate_series() for sequence generation combined with the LEFT JOIN + WHERE NULL pattern for gap detection. This is a core data quality pattern. In BigQuery, use GENERATE_DATE_ARRAY('2025-01-01', '2025-12-31', INTERVAL 1 DAY). In Snowflake, use GENERATOR(ROWCOUNT => 365) with DATEADD. Know how to generate a date spine on your target platform.

Problem 16: Sessionization

Question: Group user events into sessions. A session ends when there is more than 30 minutes of inactivity. For each session, calculate session start, end, duration in minutes, and event count.

WITH lagged AS (
    SELECT
        user_id,
        created_at,
        LAG(created_at) OVER (PARTITION BY user_id ORDER BY created_at) AS prev_event_time
    FROM events
),
session_flags AS (
    SELECT *,
        CASE
            WHEN prev_event_time IS NULL
              OR created_at - prev_event_time > INTERVAL '30 minutes'
            THEN 1 ELSE 0
        END AS is_session_start
    FROM lagged
),
sessions AS (
    SELECT *,
        SUM(is_session_start) OVER (PARTITION BY user_id ORDER BY created_at) AS session_id
    FROM session_flags
)
SELECT
    user_id,
    session_id,
    MIN(created_at)                                                              AS session_start,
    MAX(created_at)                                                              AS session_end,
    ROUND(EXTRACT(EPOCH FROM MAX(created_at) - MIN(created_at)) / 60, 1)        AS duration_minutes,
    COUNT(*)                                                                     AS event_count
FROM sessions
GROUP BY user_id, session_id
ORDER BY user_id, session_start;

What the interviewer is testing: The sessionization pattern, which is the same islands technique applied to time gaps instead of value changes. Detect the gap with LAG(), flag session starts with a CASE, then take a running SUM() of those flags to create a session ID. Every event within the same session increments its session ID by zero; a new session increments it by one. This pattern appears in almost every streaming and product analytics interview.

Hard: What Separates Senior Candidates

Problem 17: Recursive CTE for Hierarchies

Question: Find the full management chain from every employee up to the CEO. Return employee_id, name, level (0 = CEO), and path.

WITH RECURSIVE hierarchy AS (
    -- Base case: CEO has no manager
    SELECT
        employee_id, name, manager_id,
        0          AS level,
        name::TEXT AS path
    FROM employees
    WHERE manager_id IS NULL

    UNION ALL

    -- Recursive case: employees who report into the hierarchy
    SELECT
        e.employee_id, e.name, e.manager_id,
        h.level + 1,
        h.path || ' > ' || e.name
    FROM employees e
    JOIN hierarchy h ON e.manager_id = h.employee_id
)
SELECT employee_id, name, level, path
FROM hierarchy
ORDER BY path;

What the interviewer is testing: Recursive CTEs. The two parts joined by UNION ALL: the base case (anchor) that returns the starting rows, and the recursive case that references the CTE itself to add one more level per iteration. The recursion terminates when no rows match the JOIN, meaning there are no more employees to add. PostgreSQL, BigQuery, Snowflake, and Redshift all support WITH RECURSIVE.

Problem 18: Volume-Weighted Average Price with Ranking

Question: Calculate VWAP per ticker per month. Rank tickers by VWAP within each sector per month.

WITH monthly_vwap AS (
    SELECT
        DATE_TRUNC('month', trade_date)     AS month,
        sector,
        ticker,
        SUM(close_price * volume)::NUMERIC
        / NULLIF(SUM(volume), 0)            AS vwap
    FROM nse_trades
    GROUP BY DATE_TRUNC('month', trade_date), sector, ticker
)
SELECT
    month,
    sector,
    ticker,
    ROUND(vwap, 4)                                                             AS vwap,
    RANK() OVER (PARTITION BY month, sector ORDER BY vwap DESC)               AS sector_rank
FROM monthly_vwap
ORDER BY month, sector, sector_rank;

What the interviewer is testing: Weighted averages and multi-level window partitioning. VWAP is SUM(price * volume) / SUM(volume). It gives the true average price weighted by how much was traded at each level. The RANK() OVER (PARTITION BY month, sector ...) applies ranking within each month-sector combination. Interviewers in fintech, trading, and analytics roles ask VWAP variants constantly.

Problem 19: Running Balance with Signed Transactions

Question: Calculate each user's running account balance. Completed transactions add. Completed reversals subtract. Pending transactions have no effect.

WITH adjusted AS (
    SELECT
        user_id,
        created_at,
        CASE
            WHEN status = 'completed' AND category != 'reversal' THEN  amount
            WHEN status = 'completed' AND category = 'reversal'  THEN -amount
            ELSE 0
        END AS net_amount
    FROM transactions
)
SELECT
    user_id,
    created_at,
    net_amount,
    SUM(net_amount) OVER (
        PARTITION BY user_id
        ORDER BY created_at
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_balance
FROM adjusted
WHERE net_amount != 0
ORDER BY user_id, created_at;

What the interviewer is testing: Transform first, then window. Convert rows into signed amounts in a CTE before applying any running total. Mixing the sign logic inside the window function makes the query both unreadable and error-prone. This pattern underpins any financial ledger query. In LedgerSync, I applied this exact structure to 1.5 million BOOST transaction rows.

Problem 20: SCD Type 2 Change Detection

Question: Given a daily snapshot table listing_snapshots (listing_id, price, status, snapshot_date), find which listings had their price change between consecutive snapshots.

WITH with_prev AS (
    SELECT
        listing_id,
        snapshot_date,
        price                                                AS new_price,
        LAG(price) OVER (
            PARTITION BY listing_id
            ORDER BY snapshot_date
        )                                                    AS old_price
    FROM listing_snapshots
)
SELECT
    listing_id,
    snapshot_date,
    old_price,
    new_price,
    new_price - old_price AS price_change
FROM with_prev
WHERE old_price IS NOT NULL
  AND new_price != old_price
ORDER BY listing_id, snapshot_date;

What the interviewer is testing: CDC detection logic. LAG() within PARTITION BY listing_id compares each row to the previous snapshot for the same listing. WHERE old_price IS NOT NULL excludes the first snapshot for each listing, which has no previous value. WHERE new_price != old_price keeps only rows where the price actually changed. This is the exact pattern dbt uses in SCD Type 2 snapshot models to detect which records to close.

Problem 21: Pipeline Quality Checks in SQL

Question: Your pipeline loads daily NSE data. Write a query that verifies: no duplicate trade_ids, row count between 50 and 500, no NULLs in ticker/close_price/volume, and all prices positive. Return one row per check with status (PASS/FAIL) and detail.

WITH today_data AS (
    SELECT * FROM nse_trades
    WHERE trade_date = CURRENT_DATE
)
SELECT check_name, status, detail
FROM (
    -- Check 1: No duplicate trade_ids
    SELECT
        'no_duplicate_trade_ids'                                       AS check_name,
        CASE WHEN COUNT(*) = COUNT(DISTINCT trade_id)
             THEN 'PASS' ELSE 'FAIL' END                               AS status,
        'Duplicates: ' || (COUNT(*) - COUNT(DISTINCT trade_id))::TEXT  AS detail
    FROM today_data

    UNION ALL

    -- Check 2: Row count in expected range
    SELECT
        'row_count_in_range',
        CASE WHEN COUNT(*) BETWEEN 50 AND 500 THEN 'PASS' ELSE 'FAIL' END,
        'Row count: ' || COUNT(*)::TEXT
    FROM today_data

    UNION ALL

    -- Check 3: No NULLs in critical fields
    SELECT
        'no_nulls_in_critical_fields',
        CASE WHEN COUNT(*) FILTER (
            WHERE ticker IS NULL OR close_price IS NULL OR volume IS NULL
        ) = 0 THEN 'PASS' ELSE 'FAIL' END,
        'Null rows: ' || COUNT(*) FILTER (
            WHERE ticker IS NULL OR close_price IS NULL OR volume IS NULL
        )::TEXT
    FROM today_data

    UNION ALL

    -- Check 4: All prices positive
    SELECT
        'all_prices_positive',
        CASE WHEN MIN(close_price) > 0 THEN 'PASS' ELSE 'FAIL' END,
        'Min close_price: ' || COALESCE(MIN(close_price)::TEXT, 'NULL')
    FROM today_data
) checks
ORDER BY check_name;

What the interviewer is testing: Whether you think like a data engineer, not just a SQL writer. This is the native SQL equivalent of a Great Expectations test suite. COUNT(*) FILTER (WHERE ...) is PostgreSQL syntax for conditional counting without CASE WHEN. The UNION ALL pattern assembles multiple independent checks into a single result set. A candidate who writes something like this in an interview signals production pipeline experience.

The Pattern Reference

Every problem above maps to one of these recurring patterns. Knowing the right tool for each type of question is what makes the difference in a live interview.

Pattern	Tool
Top N per group	`RANK() OVER (PARTITION BY ... ORDER BY ...)`
Running total	`SUM() OVER (ORDER BY ... ROWS UNBOUNDED PRECEDING)`
Moving average	`AVG() OVER (... ROWS BETWEEN 6 PRECEDING AND CURRENT)`
Previous row value	`LAG(col) OVER (PARTITION BY ... ORDER BY ...)`
Next row value	`LEAD(col) OVER (PARTITION BY ... ORDER BY ...)`
First/last in group	`FIRST_VALUE` / `LAST_VALUE` (always specify the frame)
Consecutive streak	`ROW_NUMBER()` minus `ROW_NUMBER() OVER group` (islands)
Session detection	`LAG` + gap flag + running `SUM` of flags
Median	`PERCENTILE_CONT(0.5)` or `ROW_NUMBER` trick
Weighted average	`SUM(val * weight) / SUM(weight)`
Pivot (long to wide)	`SUM(CASE WHEN category = 'X' THEN val END)`
Unpivot (wide to long)	`CROSS JOIN LATERAL VALUES` or `UNPIVOT`
Hierarchy traversal	`WITH RECURSIVE` CTE
Gap detection	`generate_series` left join + `WHERE NULL`
Deduplication	`ROW_NUMBER() = 1` or `DISTINCT ON` (PostgreSQL)
Division safety	`NULLIF(denominator, 0)`
Conditional count	`COUNT(*) FILTER (WHERE ...)` or `SUM(CASE WHEN ...)`
Point-in-time (SCD2)	`valid_from <= target_date AND valid_to > target_date`
YoY comparison	Self-join on same CTE aliased by year
Cohort retention	`MIN()` per user + join back to activity

What Interviewers Actually Look For

Beyond the correct answer, these are the signals a good interviewer reads:

Do you explain your approach before writing? Say "I'll use a CTE to rank within each partition, then filter" before typing. Interviewers want to hear how you think.

Do you know the difference between RANK and DENSE_RANK? Any window function problem will get a follow-up on this. Have a one-sentence answer ready.

Do you consider edge cases? Division by zero with NULLIF. NULLs in COUNT. Partial windows in moving averages. Mentioning these unprompted signals production experience.

Can you write it two ways? The subquery approach and the window function approach for the same problem. Knowing that the window function version is usually cleaner signals depth.

Do you know which patterns are database-specific? DISTINCT ON and generate_series are PostgreSQL only. PIVOT and UNPIVOT work in Snowflake and BigQuery but not standard PostgreSQL. QUALIFY for post-window filtering works in Snowflake and BigQuery but not PostgreSQL. Know what is portable and what is not.

These problems are drawn from real SQL patterns I've used building NSE stock pipelines, Kenyan property analytics, job market intelligence, and financial ledger reconciliation. All portfolio projects are on my GitHub.

Follow me on dev.to for more on data engineering, dbt, and Airflow.

DEV Community