Gowtham Potureddi

Posted on May 1

Shopify Data Engineering Interview Questions

#python #sql #dataengineering #interview

Shopify data engineering interview questions are SQL end-to-end with a sharp merchant-sessions analytics edge: five SQL primitives (DATE_TRUNC('month', session_ts) + COUNT(*) for monthly session rollups, DATE_TRUNC('day', session_ts) + COUNT(*) for daily session counts, JOIN of shops to MIN(session_ts) GROUP BY shop_id with date-difference for activation latency, AVG(sessions) OVER (PARTITION BY shop_id ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) for trend-smoothed rolling traffic, and SUBSTRING(url FROM 'utm_source=([^&]+)') regex for marketing-attribution source extraction). The framings are merchant data—shop sessions, daily and monthly traffic, activation funnels, smoothed traffic trends, and UTM-tagged landing-URL analytics.

This guide walks through the five topic clusters Shopify actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 5-problem Shopify set (3 easy, 1 medium, 1 hard)—a SQL-only hub that rewards DATE_TRUNC discipline, the rolling-window frame clause, and regex fluency. There is no Python in this loop; every minute of preparation should go to SQL.

#	Topic (sections 1–5)	Why it shows up at Shopify
1	SQL aggregation and DATE_TRUNC for monthly session counts	Monthly Session Count—`DATE_TRUNC('month', session_ts)` plus `COUNT(*)` plus `ORDER BY month`.
2	SQL DATE_TRUNC and COUNT for daily session counts	Daily Session Count—same shape with `DATE_TRUNC('day', session_ts)`; per-shop variant adds `GROUP BY shop_id, day`.
3	SQL JOINs and date arithmetic for days-to-first-session per shop	Days to First Session Per Shop—`MIN(session_ts) GROUP BY shop_id` joined to `shops.created_at` with a date-difference projection.
4	SQL window functions for 7-day rolling average sessions per shop	7-Day Rolling Average Sessions Per Shop (HARD)—`AVG(sessions) OVER (PARTITION BY shop_id ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)`.
5	SQL regex for UTM source extraction from landing URL	Extract UTM Source From Landing URL (MEDIUM)—`SUBSTRING(url FROM 'utm_source=([^&]+)')` with NULL fallback.

1. SQL Aggregation and DATE_TRUNC for Monthly Session Counts

Date truncation and aggregation for monthly traffic in SQL for data engineering

"Build a monthly session count report" is the canonical merchant-analytics interview prompt. The mental model: DATE_TRUNC('month', session_ts) collapses every session in a calendar month to that month's first day; COUNT(*) GROUP BY <truncated date> rolls sessions up to one row per month. The same shape powers daily / weekly / quarterly rollups—swap the truncation argument and the grouping key follows.

Pro tip: Always cast or pin the truncated value to DATE (not TIMESTAMP) before grouping; mixing time zones across a TIMESTAMP WITH TIME ZONE column and a DATE_TRUNC call will silently bucket rows wrong at month boundaries. DATE_TRUNC('month', session_ts)::date is the safe form on PostgreSQL.

`DATE_TRUNC('month', session_ts)` collapses to month-start

DATE_TRUNC rounds a timestamp down to the nearest unit. The invariant: for any session_ts in March 2026, DATE_TRUNC('month', session_ts) returns 2026-03-01. Every March session maps to the same bucket key.

DATE_TRUNC('month', t) — first day of the month containing t.
DATE_TRUNC('day', t) — strip time-of-day; useful when session_ts is a TIMESTAMP.
DATE_TRUNC('quarter', t) — first day of the quarter; pairs cleanly with quarterly traffic reports.
DATE_TRUNC('year', t) — first day of the year; for annual rollups.

Worked example. Five sessions across two months. After DATE_TRUNC('month', session_ts):

session_id	session_ts	month_bucket
1	2026-03-04 09:12:00	2026-03-01
2	2026-03-19 14:27:00	2026-03-01
3	2026-03-31 23:55:00	2026-03-01
4	2026-04-02 06:01:00	2026-04-01
5	2026-04-25 18:30:00	2026-04-01

Worked-example solution.

SELECT
  session_id,
  session_ts,
  DATE_TRUNC('month', session_ts)::date AS month_bucket
FROM sessions
ORDER BY session_ts;

Rule of thumb: whenever the prompt says "by month / by week / by quarter," reach for DATE_TRUNC before you reach for EXTRACT(MONTH FROM ...)—DATE_TRUNC keeps the year, so March 2025 and March 2026 stay distinct.

`COUNT(*)` over the month bucket

The aggregate invariant: COUNT(*) GROUP BY DATE_TRUNC('month', session_ts) produces one row per month with the total session count. Every monthly aggregate report follows this shape; only the truncation unit and aggregate function change.

COUNT(*) — counts every row in the bucket.
COUNT(DISTINCT visitor_id) — distinct visitors instead of total sessions.
SUM(amount) — totals amount instead of counting rows.
AVG(duration_s) — averages a duration metric.

Worked example. Continuing the previous five sessions, count by month.

month_bucket	rows	COUNT(*)
2026-03-01	3	3
2026-04-01	2	2

Worked-example solution.

SELECT
  DATE_TRUNC('month', session_ts)::date AS month,
  COUNT(*) AS sessions
FROM sessions
GROUP BY DATE_TRUNC('month', session_ts)
ORDER BY month;

Rule of thumb: the GROUP BY expression must match the SELECT expression exactly (or use the column position alias) — engines disagree on letting you GROUP BY 1 for safety.

`ORDER BY month` for time-sorted output

The output invariant: GROUP BY does not guarantee output order. For a time-series report, append ORDER BY month to surface chronological output. Without it, downstream charting tools see arbitrary order.

ORDER BY month — ascending chronological.
ORDER BY month DESC — most recent first; common for ops dashboards.
ORDER BY month + densification — for "include zero-traffic months," left-join against a generated calendar table.
No ORDER BY after GROUP BY — flaky output across engines.

Worked example. Add a January 2026 session. Output should now span Jan, Mar, Apr—but Feb is correctly absent (no sessions).

month	sessions
2026-01-01	1
2026-03-01	3
2026-04-01	2

Worked-example solution.

SELECT
  DATE_TRUNC('month', session_ts)::date AS month,
  COUNT(*) AS sessions
FROM sessions
GROUP BY DATE_TRUNC('month', session_ts)
ORDER BY month ASC;

Rule of thumb: a time-series report without ORDER BY is a bug waiting for a code review; always sort the output explicitly.

Common beginner mistakes

Using EXTRACT(MONTH FROM session_ts)—collapses every March across years into one bucket; lose the year.
Forgetting ::date cast—on TIMESTAMP WITH TIME ZONE columns, the result drifts at month boundaries.
Aliasing the truncated column as month and then GROUP BY month (engine-dependent—Postgres allows it, others don't); safer to repeat the expression.
Skipping ORDER BY—non-deterministic chronological output.
Counting COUNT(visitor_id) and being surprised by NULLs—use COUNT(*) for total rows or COUNT(DISTINCT visitor_id) for unique visitors.

SQL Interview Question on Monthly Session Count

Table sessions(session_id INT, session_ts TIMESTAMP). Return one row per month with the total count of sessions. Columns: month (first day of the month), sessions. Sort ascending by month.

Solution Using `DATE_TRUNC` and `COUNT`

SELECT
  DATE_TRUNC('month', session_ts)::date AS month,
  COUNT(*) AS sessions
FROM sessions
GROUP BY DATE_TRUNC('month', session_ts)
ORDER BY month ASC;

Step-by-step trace (input: 7 sessions across 3 months):

session_id	session_ts
1	2026-01-15 09:00:00
2	2026-03-04 11:00:00
3	2026-03-19 12:00:00
4	2026-03-31 23:50:00
5	2026-04-02 06:00:00
6	2026-04-25 18:00:00
7	2026-04-30 22:00:00

Scan sessions — read all 7 rows.
DATE_TRUNC('month', session_ts) — January row → 2026-01-01; March rows → 2026-03-01; April rows → 2026-04-01.
GROUP BY DATE_TRUNC('month', session_ts) — three groups: Jan (1 row), Mar (3 rows), Apr (3 rows).
COUNT(*) per group — Jan = 1; Mar = 3; Apr = 3.
Project + ORDER BY month ASC — sort the three rows ascending.

Output:

month	sessions
2026-01-01	1
2026-03-01	3
2026-04-01	3

Why this works — concept by concept:

DATE_TRUNC('month', session_ts) — collapses every session in a calendar month to the same first-of-month key, giving a stable group key for the aggregate.
::date cast — strips the time / timezone component so the bucket key is a clean DATE value, avoiding boundary drift.
COUNT(*) — counts every row in the month bucket; the canonical aggregate for traffic rollups.
GROUP BY matches the projected expression — safest across engines; some accept GROUP BY month (the alias) but not all.
Sparse-by-default — months with zero sessions are absent from the output; correct unless the prompt asks for zero-row densification via a calendar table.
ORDER BY month ASC — gives the consumer a chronological time-series; no client-side sort needed.
Cost — one scan of sessions, one hash aggregate → O(N) time, O(M) space where M is the number of distinct months.

SQL
Topic — aggregation
Aggregation problems

Practice →

SQL
Shopify — aggregation
Shopify aggregation problems

Practice →

2. SQL DATE_TRUNC and COUNT for Daily Session Counts

Daily session bucketing and per-shop traffic in SQL for data engineering

"Daily session count" is the same DATE_TRUNC + COUNT shape as monthly, with 'day' as the truncation unit. The mental model: DATE_TRUNC('day', session_ts) collapses to the date; COUNT(*) GROUP BY day is one row per day. The two extensions Shopify reaches for: per-shop daily counts (GROUP BY shop_id, day) and densification with a calendar series (LEFT JOIN generate_series(...) to fill zero-traffic days).

Pro tip: Daily reports on small samples can give misleading conclusions; merchant-analytics teams typically pair the daily count with a 7-day rolling average (covered in §4) to separate signal from day-of-week noise. State this trade-off out loud—interviewers reward awareness of the analytical context.

Daily bucketing: `DATE_TRUNC('day', session_ts)`

The daily-bucket invariant: DATE_TRUNC('day', session_ts) returns the date with time stripped. Every session in a calendar day maps to the same key. On a TIMESTAMP column the result is a TIMESTAMP at midnight; cast to DATE for cleaner output.

DATE_TRUNC('day', t) — strip time; result is the date at 00:00.
session_ts::date — Postgres shorthand for the same thing.
DATE(session_ts) — engine-specific synonym (MySQL, Snowflake).
Group key — bucket all intra-day sessions into the day's date.

Worked example. Three sessions on April 12 collapse to one bucket.

session_ts	day_bucket
2026-04-12 09:00:00	2026-04-12
2026-04-12 14:30:00	2026-04-12
2026-04-12 23:55:00	2026-04-12
2026-04-13 06:10:00	2026-04-13

Worked-example solution.

SELECT
  session_ts,
  DATE_TRUNC('day', session_ts)::date AS day_bucket
FROM sessions
ORDER BY session_ts;

Rule of thumb: on a pure DATE column, DATE_TRUNC is a no-op; use it when the source is a TIMESTAMP to be explicit about the truncation.

Densification with calendar generation

The densification invariant: GROUP BY day is sparse by default; days with zero sessions are absent from the output. To produce a contiguous day-by-day series, generate a calendar table and LEFT JOIN the sparse aggregate against it, replacing nulls with zero.

generate_series(start_date, end_date, '1 day') — Postgres calendar table generator.
LEFT JOIN ON cal.day = agg.day — keep every calendar row, attach session count where present.
COALESCE(agg.sessions, 0) — replace nulls with zero so charts don't break.
Window-function alternative — for some prompts, a windowed OVER (ORDER BY day) over the sparse rows is enough.

Worked example. Three days of session data with a gap on April 13.

day	sessions (sparse)
2026-04-12	3
2026-04-14	5

After densification with a calendar:

day	sessions
2026-04-12	3
2026-04-13	0
2026-04-14	5

Worked-example solution.

WITH cal AS (
  SELECT generate_series('2026-04-12'::date, '2026-04-14'::date, '1 day')::date AS day
),
agg AS (
  SELECT DATE_TRUNC('day', session_ts)::date AS day, COUNT(*) AS sessions
  FROM sessions
  GROUP BY DATE_TRUNC('day', session_ts)
)
SELECT cal.day, COALESCE(agg.sessions, 0) AS sessions
FROM cal
LEFT JOIN agg ON agg.day = cal.day
ORDER BY cal.day;

Rule of thumb: densify when downstream charts need contiguous days; leave sparse when the consumer just wants "days with traffic."

Per-shop daily counts: `GROUP BY shop_id, day`

The per-shop invariant: add shop_id to the GROUP BY to produce one row per (shop, day). The output is a 2D matrix in long form—cell (shop, day) = sessions. This is the canonical input shape for §4's rolling-window query.

GROUP BY shop_id, day — composite key; one row per cell.
Output shape — long form (one row per cell), not wide.
Densification — cross-join calendar × shops, then left-join the aggregate.
Tie-break order — ORDER BY shop_id, day for deterministic output.

Worked example. Two shops × three days of sessions.

shop_id	day	sessions
1	2026-04-12	3
1	2026-04-13	5
2	2026-04-12	1
2	2026-04-14	2

Worked-example solution.

SELECT
  shop_id,
  DATE_TRUNC('day', session_ts)::date AS day,
  COUNT(*) AS sessions
FROM sessions
GROUP BY shop_id, DATE_TRUNC('day', session_ts)
ORDER BY shop_id, day;

Rule of thumb: the long-form (shop, day, sessions) triple is the right input format for any window-function rollup over per-shop time series.

Common beginner mistakes

Using session_ts directly in GROUP BY (without truncation)—every session goes to its own group, no aggregation.
Counting COUNT(session_id) instead of COUNT(*)—usually identical, but COUNT(session_id) skips NULL ids.
Forgetting to densify and surprising the consumer with a chart with gaps.
GROUP BY 1, 2—works on Postgres but breaks on strict-mode engines; repeat the expression to be safe.
Mixing DATE_TRUNC('day', …) and a separate EXTRACT(DAY FROM …)—different output types; pick one.

SQL Interview Question on Daily Session Count Per Shop

Table sessions(session_id INT, shop_id INT, session_ts TIMESTAMP). Return one row per (shop_id, day) with the count of sessions. Columns: shop_id, day, sessions. Sort ascending by shop_id, then by day.

Solution Using `DATE_TRUNC('day', ...)` and a composite `GROUP BY`

SELECT
  shop_id,
  DATE_TRUNC('day', session_ts)::date AS day,
  COUNT(*) AS sessions
FROM sessions
GROUP BY shop_id, DATE_TRUNC('day', session_ts)
ORDER BY shop_id ASC, day ASC;

Step-by-step trace (input: 6 rows across 2 shops × 3 days):

session_id	shop_id	session_ts
1	1	2026-04-12 09:00:00
2	1	2026-04-12 14:00:00
3	1	2026-04-13 11:00:00
4	2	2026-04-12 16:00:00
5	2	2026-04-14 08:30:00
6	2	2026-04-14 22:00:00

Scan sessions — read all 6 rows.
Truncate — 2026-04-12, 2026-04-13, 2026-04-14 per row.
GROUP BY shop_id, day — four groups: (1, 2026-04-12), (1, 2026-04-13), (2, 2026-04-12), (2, 2026-04-14).
COUNT(*) — (1, 2026-04-12) = 2, (1, 2026-04-13) = 1, (2, 2026-04-12) = 1, (2, 2026-04-14) = 2.
ORDER BY shop_id, day — alphabetic shops then chronological days.

Output:

shop_id	day	sessions
1	2026-04-12	2
1	2026-04-13	1
2	2026-04-12	1
2	2026-04-14	2

Why this works — concept by concept:

DATE_TRUNC('day', session_ts) — collapses every intra-day timestamp to the calendar date; the canonical day key.
Composite GROUP BY (shop_id, day) — produces one row per (shop, day) cell, exactly what merchant analytics dashboards consume.
COUNT(*) — total session count per cell; the simplest possible traffic metric.
Long-form output — one row per cell rather than a wide pivot keeps the result compatible with downstream tools that prefer long format.
ORDER BY shop_id, day — deterministic output; charting libraries can render directly without a sort step.
Sparse-by-default — days with no sessions for a shop are absent; densify with a calendar cross-join only if the consumer needs zero-rows.
Cost — one scan of sessions, one hash aggregate over the composite key → O(N) time, O(S × D) space where S is shops and D is distinct days.

SQL
Topic — date functions
Date-functions problems

Practice →

SQL
Shopify — date functions
Shopify date-functions problems

Practice →

3. SQL JOINs and Date Arithmetic for Days-to-First-Session Per Shop

Activation latency via JOIN and date difference in SQL for data engineering

"For each shop, how many days from created_at until the first session?" is the canonical activation-latency interview prompt. The mental model: MIN(session_ts) GROUP BY shop_id gives the first-touch per shop; JOIN to shops.created_at to anchor the baseline; subtract dates to produce the latency in days. This is the bread-and-butter pattern for any "time-to-first-X" question—time-to-first-purchase, time-to-first-payout, time-to-first-active-user.

Pro tip: Use a LEFT JOIN from shops to the first-session CTE—not the other way round. Some shops never had a session; you want them in the output with NULL (or "never activated") rather than dropped silently. The interviewer will probe the missing-side case if you skip it.

`MIN(session_ts) GROUP BY shop_id` for first-touch

The first-touch invariant: MIN(session_ts) GROUP BY shop_id collapses each shop to one row whose timestamp is the shop's first session. Repeated sessions by the same shop contribute zero to "first" because we keep only the minimum.

MIN(session_ts) — earliest session per shop.
GROUP BY shop_id — one row per shop.
Optional bound — WHERE session_ts >= '2026-04-01' if the report is windowed.
Cast — MIN(session_ts)::date if you want the date only (no time component).

Worked example. Five sessions across three shops.

session_id	shop_id	session_ts
1	1	2026-04-12 09:00:00
2	1	2026-04-15 12:00:00
3	2	2026-04-13 14:00:00
4	3	2026-04-20 16:00:00
5	3	2026-04-22 18:00:00

After MIN(session_ts) GROUP BY shop_id:

shop_id	first_session_ts
1	2026-04-12 09:00:00
2	2026-04-13 14:00:00
3	2026-04-20 16:00:00

Worked-example solution.

SELECT
  shop_id,
  MIN(session_ts) AS first_session_ts
FROM sessions
GROUP BY shop_id;

Rule of thumb: "first time a thing happened per entity" → MIN(timestamp) GROUP BY entity_id.

JOIN to `shops.created_at` to anchor the baseline

The JOIN invariant: the activation baseline lives on the shops table; the first session lives on sessions. JOIN them on shop_id to bring both timestamps into one row. Use LEFT JOIN from shops so shops that never had a session still appear in the output.

LEFT JOIN first_touch ON first_touch.shop_id = shops.id — keep every shop.
shops.created_at — the activation baseline.
first_touch.first_session_ts — the first-session timestamp (NULL if never activated).
COALESCE(first_session_ts::date - shops.created_at::date, NULL) — latency in days; remains NULL for never-activated shops.

Worked example. Three shops with created_at; only the first two had a session.

shop_id	created_at	first_session_ts
1	2026-04-10	2026-04-12 09:00:00
2	2026-04-11	2026-04-13 14:00:00
3	2026-04-15	NULL

Worked-example solution.

WITH first_touch AS (
  SELECT shop_id, MIN(session_ts) AS first_session_ts
  FROM sessions
  GROUP BY shop_id
)
SELECT
  s.id AS shop_id,
  s.created_at,
  ft.first_session_ts
FROM shops s
LEFT JOIN first_touch ft ON ft.shop_id = s.id
ORDER BY s.id;

Rule of thumb: the JOIN brings the baseline onto the metric row; in activation funnels, the baseline is always the entity's creation time.

Date difference: `MIN(session_ts)::date - shops.created_at::date`

The difference invariant: (date_a - date_b) returns an integer count of days (in PostgreSQL; engine-specific elsewhere—DATEDIFF on Snowflake/MySQL). Cast both to DATE to get day granularity; leave as TIMESTAMP for sub-day precision.

(d1 - d2) — Postgres day-difference (returns integer).
DATEDIFF('day', d2, d1) — Snowflake / Databricks.
DATEDIFF(d1, d2) — MySQL.
EXTRACT(EPOCH FROM (t1 - t2)) / 86400.0 — Postgres for fractional days.

Worked example. Continuing the previous CTE.

shop_id	created_at	first_session_ts	days_to_first_session
1	2026-04-10	2026-04-12	2
2	2026-04-11	2026-04-13	2
3	2026-04-15	NULL	NULL

Worked-example solution.

WITH first_touch AS (
  SELECT shop_id, MIN(session_ts)::date AS first_day
  FROM sessions
  GROUP BY shop_id
)
SELECT
  s.id AS shop_id,
  s.created_at::date AS created_day,
  ft.first_day,
  (ft.first_day - s.created_at::date) AS days_to_first_session
FROM shops s
LEFT JOIN first_touch ft ON ft.shop_id = s.id
ORDER BY s.id;

Rule of thumb: always cast both sides to DATE before subtracting; mixing TIMESTAMP - TIMESTAMP returns an INTERVAL, not an integer.

Common beginner mistakes

Using INNER JOIN and dropping never-activated shops—use LEFT JOIN to keep them with NULL.
Forgetting to cast both sides to DATE—TIMESTAMP - TIMESTAMP returns an INTERVAL.
Computing days = (first - created) / 86400.0 from epoch seconds—correct but verbose; use the date subtraction form.
GROUP BY shop_id, created_at without a CTE—mixes aggregation and per-row selection in confusing ways.
Treating MIN(session_ts) < created_at as impossible—data quality issues happen; consider whether to filter or surface them.

SQL Interview Question on Days to First Session Per Shop

Tables shops(id INT, created_at TIMESTAMP) and sessions(session_id INT, shop_id INT, session_ts TIMESTAMP). Return one row per shop with shop_id, created_at (date), and days_to_first_session (integer) — the day count between created_at and the shop's first session. For shops with no session, return NULL for days_to_first_session. Sort ascending by shop_id.

Solution Using a first-touch CTE and a LEFT JOIN with date subtraction

WITH first_touch AS (
  SELECT shop_id, MIN(session_ts)::date AS first_day
  FROM sessions
  GROUP BY shop_id
)
SELECT
  s.id AS shop_id,
  s.created_at::date AS created_at,
  (ft.first_day - s.created_at::date) AS days_to_first_session
FROM shops s
LEFT JOIN first_touch ft ON ft.shop_id = s.id
ORDER BY s.id ASC;

Step-by-step trace (input: 4 shops, 5 sessions; shop 4 never activated):

shops: id	created_at
1	2026-04-10 08:00:00
2	2026-04-11 09:00:00
3	2026-04-12 10:00:00
4	2026-04-15 11:00:00

sessions: session_id	shop_id	session_ts
1	1	2026-04-12 14:00:00
2	1	2026-04-15 09:00:00
3	2	2026-04-13 16:00:00
4	3	2026-04-12 11:00:00
5	3	2026-04-14 12:00:00

Build first_touch — MIN(session_ts)::date GROUP BY shop_id:
- Shop 1 → 2026-04-12
- Shop 2 → 2026-04-13
- Shop 3 → 2026-04-12
- Shop 4 → no row (never activated).
LEFT JOIN — shops (4 rows) joined to first_touch (3 rows). Shop 4's first_day is NULL.
Date subtraction —
- Shop 1: 2026-04-12 - 2026-04-10 = 2
- Shop 2: 2026-04-13 - 2026-04-11 = 2
- Shop 3: 2026-04-12 - 2026-04-12 = 0
- Shop 4: NULL - 2026-04-15 = NULL
ORDER BY shop_id ASC — final ascending.

Output:

shop_id	created_at	days_to_first_session
1	2026-04-10	2
2	2026-04-11	2
3	2026-04-12	0
4	2026-04-15	NULL

Why this works — concept by concept:

MIN(session_ts) GROUP BY shop_id — collapses each shop into one row whose date is its first interaction; the canonical "first" definition.
::date cast on MIN — strips the time component so the subtraction returns an integer day count, not an interval.
LEFT JOIN from shops — keeps every shop in the output; shops that never appeared in sessions show NULL for first_day and therefore NULL for the latency, which is exactly the desired semantics.
Date subtraction (first_day - created_at::date) — Postgres returns an integer day-count; engine-specific DATEDIFF on other dialects.
NULL-propagating arithmetic — NULL - date = NULL automatically, so never-activated shops don't need a special-case branch.
ORDER BY shop_id ASC — deterministic output regardless of physical row order.
Cost — one scan of sessions for the CTE, one hash join + one scan of shops, one scalar projection → O(N + M) time where N = sessions and M = shops.

SQL
Topic — joins
JOIN problems

Practice →

SQL
Topic — date arithmetic
Date-arithmetic problems

Practice →

4. SQL Window Functions for 7-Day Rolling Average Sessions Per Shop

Frame-clause window functions for trailing rolling averages in SQL for data engineering

"For each shop, compute a 7-day trailing rolling average of daily sessions" is the Hard-tier interview prompt for merchant analytics. The mental model: AVG(sessions) OVER (PARTITION BY shop_id ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) averages the current day's sessions plus the prior six days into a single value—repeated for every (shop, day) cell. The frame clause ROWS BETWEEN 6 PRECEDING AND CURRENT ROW is the entire trick; without it, the default frame includes everything from partition start to current row, which is a cumulative average, not a 7-day rolling.

Pro tip: The default window frame for an ORDER BY window is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW—a cumulative window, not a rolling one. Always state the frame explicitly: ROWS BETWEEN N PRECEDING AND CURRENT ROW for trailing N+1; ROWS BETWEEN N PRECEDING AND N FOLLOWING for centered. Default frames bite candidates who think OVER (ORDER BY day) alone gives a rolling average.

`PARTITION BY shop_id ORDER BY day` for per-shop time windows

The window-partition invariant: PARTITION BY shop_id slices the row set into independent per-shop windows; ORDER BY day defines what "previous" means within each partition. The rolling average never crosses shops because the partition isolates each.

PARTITION BY shop_id — independent per-shop frames.
ORDER BY day — chronological order within partition.
No PARTITION BY — the window spans the whole table (one global rolling average).
ORDER BY day ASC — explicit ascending; DESC would invert what "preceding" means.

Worked example. Two shops × 7 days each. Partition splits the 14 rows into two independent windows.

shop_id	day	sessions
1	2026-04-08	50
1	2026-04-09	60
1	2026-04-10	55
...	...	...
2	2026-04-08	30
2	2026-04-09	40
...	...	...

Worked-example solution.

SELECT
  shop_id, day, sessions,
  COUNT(*) OVER (PARTITION BY shop_id ORDER BY day) AS rownum_so_far
FROM daily_sessions
ORDER BY shop_id, day;

Rule of thumb: if the prompt says "per shop" or "per user" with a time series, PARTITION BY entity ORDER BY time is the correct skeleton.

`ROWS BETWEEN 6 PRECEDING AND CURRENT ROW` frame clause

The frame invariant: ROWS BETWEEN N PRECEDING AND CURRENT ROW includes the current row and the N rows physically before it in the partition's order. For a 7-day trailing window, N = 6 (current + 6 preceding = 7 total).

ROWS BETWEEN 6 PRECEDING AND CURRENT ROW — last 7 rows.
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW — cumulative.
ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING — 7-row centered window.
RANGE BETWEEN INTERVAL '6 days' PRECEDING AND CURRENT ROW — 7 calendar days regardless of gaps.

Worked example. 10 daily session counts for shop 1; show the trailing-7 frame at three different days.

day	sessions	frame at this day	window contents
04-08	50	1 row	{50}
04-09	60	2 rows	{50, 60}
04-14	75	7 rows	{50, 60, 55, 70, 65, 80, 75}

Worked-example solution.

SELECT
  shop_id, day, sessions,
  COUNT(*) OVER (
    PARTITION BY shop_id
    ORDER BY day
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS frame_size
FROM daily_sessions
ORDER BY shop_id, day;
-- frame_size grows from 1 (day 1) up to 7 (day 7+) and stays at 7 thereafter.

Rule of thumb: ROWS counts physical rows; RANGE counts logical values (good for date-aware windows on irregular series). Pick ROWS for "last 7 daily rows," RANGE for "last 7 calendar days regardless of gaps."

`AVG(sessions)` over the bounded frame for trailing 7-day average

The aggregate-over-frame invariant: placing AVG(sessions) in front of the OVER (...) clause computes the average over the frame for each row. Repeat for every (shop, day) cell; the engine slides the frame by one row at a time.

AVG(sessions) OVER (...) — rolling average.
SUM(sessions) OVER (...) — rolling sum.
MAX(sessions) OVER (...) — rolling max.
COUNT(*) OVER (...) — frame size; 1, 2, ..., 7, 7, 7 for a ROWS BETWEEN 6 PRECEDING AND CURRENT ROW frame.

Worked example. Shop 1, 10 daily session counts.

day	sessions	trailing_7_avg
04-08	50	50.00 (1 row)
04-09	60	55.00 (2 rows)
04-10	55	55.00 (3 rows)
04-11	70	58.75 (4 rows)
04-12	65	60.00 (5 rows)
04-13	80	63.33 (6 rows)
04-14	75	65.00 (7 rows)
04-15	90	70.71 (last 7)

Worked-example solution.

SELECT
  shop_id,
  day,
  sessions,
  ROUND(
    AVG(sessions) OVER (
      PARTITION BY shop_id
      ORDER BY day
      ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ),
    2
  ) AS trailing_7_avg
FROM daily_sessions
ORDER BY shop_id, day;

Rule of thumb: always ROUND(..., 2) the rolling average for readable output; trailing decimals look noisy.

Common beginner mistakes

Omitting the frame clause and getting a cumulative average instead of a rolling one.
Using RANGE BETWEEN 6 PRECEDING AND CURRENT ROW on an integer column—works but on a date column it counts logical days, not rows; pick deliberately.
Forgetting PARTITION BY shop_id—the average bleeds across shops.
Computing the rolling average from raw sessions rows rather than from a pre-aggregated daily CTE—correct only if there's exactly one row per (shop, day).
ORDER BY day DESC inside the window—inverts what "preceding" means; the rolling average becomes a leading-7 instead of trailing.

SQL Interview Question on 7-Day Rolling Average Sessions Per Shop

Table daily_sessions(shop_id INT, day DATE, sessions INT) with one row per (shop_id, day). Return shop_id, day, sessions, and trailing_7_avg — the rolling average of daily sessions over the current day plus the 6 preceding days, per shop. Round trailing_7_avg to 2 decimals. Sort ascending by shop_id, then by day.

Solution Using `AVG(...) OVER (PARTITION BY shop_id ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)`

SELECT
  shop_id,
  day,
  sessions,
  ROUND(
    AVG(sessions) OVER (
      PARTITION BY shop_id
      ORDER BY day
      ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    )::numeric,
    2
  ) AS trailing_7_avg
FROM daily_sessions
ORDER BY shop_id ASC, day ASC;

Step-by-step trace (input: 1 shop × 8 days):

shop_id	day	sessions
1	2026-04-08	50
1	2026-04-09	60
1	2026-04-10	55
1	2026-04-11	70
1	2026-04-12	65
1	2026-04-13	80
1	2026-04-14	75
1	2026-04-15	90

Partition — only one shop, so one partition with 8 rows.
Order — by day ascending.
Frame — ROWS BETWEEN 6 PRECEDING AND CURRENT ROW; the frame has 1 row on day 1, grows to 7 by day 7, stays at 7 for day 8.
AVG(sessions) per row —
- Day 1: AVG of {50} = 50.00.
- Day 2: AVG of {50, 60} = 55.00.
- Day 3: AVG of {50, 60, 55} = 55.00.
- Day 4: AVG of {50, 60, 55, 70} = 58.75.
- Day 5: AVG of {50, 60, 55, 70, 65} = 60.00.
- Day 6: AVG of {50, 60, 55, 70, 65, 80} = 63.33.
- Day 7: AVG of {50, 60, 55, 70, 65, 80, 75} = 65.00.
- Day 8: AVG of {60, 55, 70, 65, 80, 75, 90} = 70.71 (the day-1 value drops out as the window slides).
ROUND(..., 2) — clamp to 2 decimals.

Output:

shop_id	day	sessions	trailing_7_avg
1	2026-04-08	50	50.00
1	2026-04-09	60	55.00
1	2026-04-10	55	55.00
1	2026-04-11	70	58.75
1	2026-04-12	65	60.00
1	2026-04-13	80	63.33
1	2026-04-14	75	65.00
1	2026-04-15	90	70.71

Why this works — concept by concept:

PARTITION BY shop_id — slices the row set into per-shop windows so the rolling average is computed independently per shop; never bleeds across shops.
ORDER BY day — defines the chronological order that "preceding" depends on; without it, "preceding" is undefined.
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW — the explicit frame clause; bounds the average to the current row plus the six rows before it for a 7-row trailing window.
AVG(sessions) OVER (...) — the aggregate-over-frame; the engine slides the frame by one row at each cell and averages the frame's sessions values.
Ramp-up at partition start — for the first six days per shop, the frame is shorter than 7 rows; the average uses whatever's available, which is the standard rolling-average convention (avoid silently emitting NULL just because there isn't a full 7-day history).
ROUND(..., 2) — clamps the floating-point output to two decimals for readable charting.
Cost — one scan of daily_sessions, one window pass with a partition-sort buffer of size 7 → O(N log N) worst-case if the engine sorts inside the window; typically O(N) with a hash-partition operator.

SQL
Topic — window functions
Window-function problems

Practice →

SQL
Topic — moving average
Moving-average problems

Practice →

5. SQL Regex for UTM Source Extraction From Landing URL

Regex extraction from query strings in SQL for data engineering

"Extract the utm_source value from a landing URL" is the canonical regex interview prompt for marketing-attribution work. The mental model: the landing URL has a query string ?utm_source=google&utm_medium=cpc&utm_campaign=spring; the regex pattern utm_source=([^&]+) matches the literal utm_source= then captures everything until the next & or end of string. The SUBSTRING(url FROM 'pattern') form returns the capture group; missing parameters return NULL cleanly.

Pro tip: Different engines spell this differently. Postgres: SUBSTRING(url FROM 'utm_source=([^&]+)'). Snowflake: REGEXP_SUBSTR(url, 'utm_source=([^&]+)', 1, 1, 'e', 1). MySQL: REGEXP_SUBSTR(url, 'utm_source=[^&]+') then SUBSTRING_INDEX. Ask the interviewer which dialect to target before writing the query.

Regex pattern: `utm_source=([^&]+)`

The regex invariant: utm_source= is a literal anchor; ([^&]+) captures one or more characters that are not &. The capture group (...) is what gets extracted; the outer literal anchors the match to the right query parameter.

utm_source= — literal; matches that exact substring.
(...) capture group — the substring to extract.
[^&]+ — one or more characters that are not & (the query-string delimiter).
+ quantifier — one or more; use * for zero-or-more (matches utm_source= with no value).

Worked example. Three landing URLs.

url	match	capture
`/landing?utm_source=google&utm_medium=cpc`	`utm_source=google`	`google`
`/landing?ref=email&utm_source=fb`	`utm_source=fb`	`fb`
`/landing?utm_medium=cpc`	(no match)	NULL

Worked-example solution.

-- Pattern walkthrough
SELECT
  '/landing?utm_source=google&utm_medium=cpc'    AS url,
  SUBSTRING('/landing?utm_source=google&utm_medium=cpc' FROM 'utm_source=([^&]+)') AS utm_source;
-- utm_source = 'google'

Rule of thumb: always anchor the literal (here utm_source=); a bare [^&]+ matches the very first param value, not the right one.

Capture group extraction with `SUBSTRING(url FROM 'pattern')`

The extraction invariant: SUBSTRING(string FROM regex) returns the matching capture group, or the entire match if no capture group is given. The Postgres function is the cleanest cross-dialect-portable form.

SUBSTRING(url FROM 'utm_source=([^&]+)') — Postgres; returns capture group.
REGEXP_SUBSTR(url, 'utm_source=[^&]+') — Snowflake / Oracle; returns the entire match (would include utm_source=).
REGEXP_REPLACE(url, '.*utm_source=([^&]+).*', '\1') — alternative; replace whole string with capture group.
split_part(split_part(url, 'utm_source=', 2), '&', 1) — non-regex alternative; cleaner on engines without strong regex support.

Worked example. Apply SUBSTRING ... FROM to four URLs.

url	extracted utm_source
`/landing?utm_source=google&utm_medium=cpc`	`google`
`/landing?ref=email&utm_source=fb`	`fb`
`/landing?utm_medium=cpc`	NULL
`/landing`	NULL

Worked-example solution.

SELECT
  url,
  SUBSTRING(url FROM 'utm_source=([^&]+)') AS utm_source
FROM landings;

Rule of thumb: SUBSTRING(... FROM 'pattern') is the cleanest form; if you need cross-engine portability, fall back to split_part chained calls.

NULL fallback when the parameter is absent

The fallback invariant: regex extraction returns NULL when the pattern doesn't match; downstream consumers usually want a labeled fallback like 'unknown'. Wrap with COALESCE(extracted, 'unknown') to surface a non-NULL value when the parameter is missing.

COALESCE(extracted, 'unknown') — labeled fallback.
COALESCE(extracted, 'direct') — common attribution convention; "no UTM = direct traffic."
CASE WHEN extracted IS NULL THEN 'direct' ELSE extracted END — equivalent; verbose.
NULLIF(extracted, '') — defensive; treats empty string as NULL too.

Worked example. Apply COALESCE to the previous result.

url	utm_source	with_fallback
`/landing?utm_source=google&utm_medium=cpc`	`google`	`google`
`/landing?ref=email&utm_source=fb`	`fb`	`fb`
`/landing?utm_medium=cpc`	NULL	`direct`
`/landing`	NULL	`direct`

Worked-example solution.

SELECT
  url,
  COALESCE(
    SUBSTRING(url FROM 'utm_source=([^&]+)'),
    'direct'
  ) AS utm_source
FROM landings;

Rule of thumb: the convention in marketing analytics is 'direct' for missing UTM, not NULL — surface it consistently so downstream attribution reports have a single category for "no source."

Common beginner mistakes

Using LIKE '%utm_source=%' instead of regex—matches presence but doesn't extract the value.
Forgetting the capture group (...) and getting back utm_source=google instead of google.
Using [^&]* (zero-or-more) and matching empty string when the parameter has no value—use [^&]+ (one-or-more) unless you specifically want to match empty values.
Skipping the COALESCE-to-'direct' step—downstream attribution reports show NULL as a separate category.
Anchoring with ^utm_source= and missing matches when utm_source isn't the first parameter.

SQL Interview Question on UTM Source Extraction

Table landings(landing_id INT, url TEXT). Return one row per landing with landing_id, url, and utm_source — the extracted value of the utm_source query parameter. If the URL has no utm_source, return the literal string 'direct'. Sort ascending by landing_id.

Solution Using `SUBSTRING(url FROM 'utm_source=([^&]+)')` with `COALESCE`

SELECT
  landing_id,
  url,
  COALESCE(
    SUBSTRING(url FROM 'utm_source=([^&]+)'),
    'direct'
  ) AS utm_source
FROM landings
ORDER BY landing_id ASC;

Step-by-step trace (input: 4 landings):

landing_id	url
1	`/landing?utm_source=google&utm_medium=cpc&utm_campaign=spring`
2	`/landing?ref=email&utm_source=facebook&utm_medium=social`
3	`/landing?utm_medium=cpc`
4	`/landing`

Row 1 — pattern utm_source=([^&]+) matches utm_source=google; capture group is google. COALESCE('google', 'direct') returns google.
Row 2 — pattern matches utm_source=facebook; capture group is facebook. COALESCE('facebook', 'direct') returns facebook.
Row 3 — pattern doesn't match (utm_source not present); SUBSTRING ... FROM returns NULL. COALESCE(NULL, 'direct') returns 'direct'.
Row 4 — pattern doesn't match; same as row 3, returns 'direct'.
ORDER BY landing_id ASC — final ascending order.

Output:

landing_id	url	utm_source
1	`/landing?utm_source=google&utm_medium=cpc&utm_campaign=spring`	google
2	`/landing?ref=email&utm_source=facebook&utm_medium=social`	facebook
3	`/landing?utm_medium=cpc`	direct
4	`/landing`	direct

Why this works — concept by concept:

Literal anchor utm_source= — pins the regex to the right parameter; [^&]+ alone would match any first-encountered query value.
Capture group ([^&]+) — selects only the value (everything after utm_source= up to the next & or end of string), excluding the anchor literal.
[^&] character class — "any character except ampersand"; the only delimiter that ends a query-parameter value.
+ quantifier — one or more; ignores empty utm_source= cases (use * if you want them).
SUBSTRING(... FROM 'pattern') — Postgres returns the capture group when the pattern has one; cleaner than REGEXP_SUBSTR variants that return the entire match.
COALESCE(extracted, 'direct') — labeled fallback for the no-UTM case; downstream attribution reports get a single bucket for missing-source instead of NULL leakage.
ORDER BY landing_id ASC — deterministic output regardless of physical row order.
Cost — one scan of landings, one regex evaluation per row → O(N · L) time where L is average URL length (regex is linear in input length).

SQL
Topic — regular expressions
Regex problems

Practice →

SQL
Topic — string manipulation
String-manipulation problems

Practice →

Tips to Crack Shopify Data Engineering Interviews

SQL is the entire loop—skip Python prep

Every problem in the curated Shopify SQL practice set is SQL. There is no Python in this loop. Spending interview-prep time on Python algorithms is a misallocation; spend it on DATE_TRUNC, the rolling-window frame clause, regex extraction, and MIN(date) GROUP BY entity first-touch CTEs. Expect 3–5 SQL coding problems across the loop, all framed as merchant-session analytics.

Drill the rolling-window frame clause until automatic

The Hard-tier #555 7-Day Rolling Average problem hinges on one structural choice: writing ROWS BETWEEN 6 PRECEDING AND CURRENT ROW instead of relying on the default frame. The default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW—a cumulative average, not a rolling one. Drill the window-functions practice page and the moving-average practice page until the explicit frame clause is muscle memory.

Merchant-analytics framing on every prompt

Frame your verbal walkthrough in merchant-analytics vocabulary: shops, sessions, days, months, activations, attribution, UTM sources. Interviewers test whether you can translate the abstract SQL primitive to the business question—matching the interviewer's mental model of what their team builds. The Shopify easy practice set leans on these framings; mirror that vocabulary in your answer.

Regex in SQL is a real, expected primitive

556 UTM Source Extraction is the medium-tier proof that regex is part of the SQL toolkit at Shopify. Memorize the Postgres form `SUBSTRING(url FROM 'utm_source=([^&]+)')` and the Snowflake form `REGEXP_SUBSTR(...)`. Always pair the extraction with a `COALESCE(..., 'direct')` for the no-UTM case. Drill the regular-expressions practice page until you can write the pattern from memory.

Where to practice on PipeCode

Start with the Shopify practice page for the curated 5-problem set. After that, drill the matching topic pages: aggregation, date-functions, window-functions, moving-average, regular-expressions. The interview courses page bundles the SQL course if you want a structured curriculum before company drills. For broad coverage, browse by topic.

Communication and approach under time pressure

For every prompt, narrate the structure before writing code: "I'll truncate session_ts to month, COUNT, GROUP BY, ORDER BY ascending." Interviewers grade process as much as the final answer. Leave 5 minutes at the end of each problem for an edge-case sweep: empty input, NULL created_at, never-activated shops, missing utm_source, single-row partitions in window functions. The most common "almost passed" failure mode is correct happy-path code that crashes on the empty input—a 30-second sweep prevents it.

Frequently Asked Questions

What is the Shopify data engineering interview process like?

Shopify's data engineering interview opens with a recruiter screen, then a technical phone screen with one SQL coding problem, then an onsite (or virtual onsite) of four to five rounds: two to three SQL coding rounds (DATE_TRUNC rollups, JOIN + first-touch, rolling-window aggregates, regex extraction), one data-modeling discussion (sessions, shops, attribution dimensional models), and one to two behavioral rounds. Shopify does not test Python in the data engineering loop in this curated set; every coding round is SQL on merchant-session analytics.

What SQL topics does Shopify test for data engineers?

Shopify SQL interviews concentrate on five primitives that correspond directly to the curated 5-problem set: DATE_TRUNC for monthly and daily rollups (#553, daily count), MIN(timestamp) GROUP BY entity plus JOIN plus date subtraction for activation latency (#554), AVG(...) OVER (PARTITION BY ... ORDER BY ... ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) for rolling averages (#555 — HARD), and regex SUBSTRING(url FROM 'pattern') for query-string extraction (#556 — MEDIUM). Memorize these five patterns and you will recognize most Shopify prompts within the first 60 seconds.

Does Shopify test Python in data engineering interviews?

The curated Shopify practice set is 5 SQL, 0 Python. Shopify's data engineering interview is SQL-only in the curated loop. If a Python question appears, treat it as a different role family (data scientist, analytics engineer with Python responsibilities) rather than the canonical data-engineering loop covered in this guide. Spending prep time on Python algorithms for a Shopify data-engineering interview is a misallocation; SQL fluency is the entire signal.

How hard are Shopify data engineering interview questions?

The Shopify set is 3 easy + 1 medium + 1 hard. The Hard-tier problem (#555 7-Day Rolling Average Sessions Per Shop) is the make-or-break round; without explicit ROWS BETWEEN 6 PRECEDING AND CURRENT ROW, candidates produce a cumulative average instead. The MEDIUM problem (#556 Extract UTM Source From Landing URL) is regex pattern-recognition once you have seen utm_source=([^&]+). The three EASY problems are DATE_TRUNC + COUNT rollups and a JOIN-with-date-difference; interviewers expect zero hesitation and clean tie-break handling.

Are window functions and rolling averages common in Shopify interviews?

Yes—#555 7-Day Rolling Average Sessions Per Shop is explicit. If you cannot fluently write AVG(sessions) OVER (PARTITION BY shop_id ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW), you cannot pass the Shopify HARD round. Drill the window-functions practice page and the moving-average practice page until the frame clause is muscle memory.

How many Shopify practice problems should I solve before the interview?

Solve all 5 problems on the Shopify practice page end-to-end—untimed first, then timed at 25 minutes per problem. After that, broaden to 30 to 40 additional SQL problems spread across the matching topic pages: aggregation, date-functions, joins, window-functions, moving-average, regular-expressions. The Shopify SQL practice page and the Shopify easy practice page are the right surfaces for the final week of prep.

Top Shopify data engineering interview topics

1. SQL Aggregation and DATE_TRUNC for Monthly Session Counts

Date truncation and aggregation for monthly traffic in SQL for data engineering

DATE_TRUNC('month', session_ts) collapses to month-start

COUNT(*) over the month bucket

ORDER BY month for time-sorted output

SQL Interview Question on Monthly Session Count

Solution Using DATE_TRUNC and COUNT

2. SQL DATE_TRUNC and COUNT for Daily Session Counts

Daily session bucketing and per-shop traffic in SQL for data engineering

Daily bucketing: DATE_TRUNC('day', session_ts)

Densification with calendar generation

Per-shop daily counts: GROUP BY shop_id, day

SQL Interview Question on Daily Session Count Per Shop

Solution Using DATE_TRUNC('day', ...) and a composite GROUP BY

3. SQL JOINs and Date Arithmetic for Days-to-First-Session Per Shop

Activation latency via JOIN and date difference in SQL for data engineering

MIN(session_ts) GROUP BY shop_id for first-touch

JOIN to shops.created_at to anchor the baseline

Date difference: MIN(session_ts)::date - shops.created_at::date

SQL Interview Question on Days to First Session Per Shop

Solution Using a first-touch CTE and a LEFT JOIN with date subtraction

4. SQL Window Functions for 7-Day Rolling Average Sessions Per Shop

Frame-clause window functions for trailing rolling averages in SQL for data engineering

PARTITION BY shop_id ORDER BY day for per-shop time windows

ROWS BETWEEN 6 PRECEDING AND CURRENT ROW frame clause

AVG(sessions) over the bounded frame for trailing 7-day average

SQL Interview Question on 7-Day Rolling Average Sessions Per Shop

Solution Using AVG(...) OVER (PARTITION BY shop_id ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)

5. SQL Regex for UTM Source Extraction From Landing URL

Regex extraction from query strings in SQL for data engineering

Regex pattern: utm_source=([^&]+)

Capture group extraction with SUBSTRING(url FROM 'pattern')

NULL fallback when the parameter is absent

SQL Interview Question on UTM Source Extraction

Solution Using SUBSTRING(url FROM 'utm_source=([^&]+)') with COALESCE

Tips to Crack Shopify Data Engineering Interviews

SQL is the entire loop—skip Python prep

Drill the rolling-window frame clause until automatic

Merchant-analytics framing on every prompt

Regex in SQL is a real, expected primitive

Where to practice on PipeCode

Communication and approach under time pressure

Frequently Asked Questions

What is the Shopify data engineering interview process like?

What SQL topics does Shopify test for data engineers?

Does Shopify test Python in data engineering interviews?

How hard are Shopify data engineering interview questions?

Are window functions and rolling averages common in Shopify interviews?

How many Shopify practice problems should I solve before the interview?

Start practicing Shopify data engineering problems

`DATE_TRUNC('month', session_ts)` collapses to month-start

`COUNT(*)` over the month bucket

`ORDER BY month` for time-sorted output

Solution Using `DATE_TRUNC` and `COUNT`

Daily bucketing: `DATE_TRUNC('day', session_ts)`

Per-shop daily counts: `GROUP BY shop_id, day`

Solution Using `DATE_TRUNC('day', ...)` and a composite `GROUP BY`

`MIN(session_ts) GROUP BY shop_id` for first-touch

JOIN to `shops.created_at` to anchor the baseline

Date difference: `MIN(session_ts)::date - shops.created_at::date`

`PARTITION BY shop_id ORDER BY day` for per-shop time windows

`ROWS BETWEEN 6 PRECEDING AND CURRENT ROW` frame clause

`AVG(sessions)` over the bounded frame for trailing 7-day average

Solution Using `AVG(...) OVER (PARTITION BY shop_id ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)`

Regex pattern: `utm_source=([^&]+)`

Capture group extraction with `SUBSTRING(url FROM 'pattern')`

Solution Using `SUBSTRING(url FROM 'utm_source=([^&]+)')` with `COALESCE`