Shopify data engineering interview questions are SQL end-to-end with a sharp merchant-sessions analytics edge: five SQL primitives (DATE_TRUNC('month', session_ts) + COUNT(*) for monthly session rollups, DATE_TRUNC('day', session_ts) + COUNT(*) for daily session counts, JOIN of shops to MIN(session_ts) GROUP BY shop_id with date-difference for activation latency, AVG(sessions) OVER (PARTITION BY shop_id ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) for trend-smoothed rolling traffic, and SUBSTRING(url FROM 'utm_source=([^&]+)') regex for marketing-attribution source extraction). The framings are merchant data—shop sessions, daily and monthly traffic, activation funnels, smoothed traffic trends, and UTM-tagged landing-URL analytics.
This guide walks through the five topic clusters Shopify actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 5-problem Shopify set (3 easy, 1 medium, 1 hard)—a SQL-only hub that rewards DATE_TRUNC discipline, the rolling-window frame clause, and regex fluency. There is no Python in this loop; every minute of preparation should go to SQL.
Top Shopify data engineering interview topics
From the Shopify data engineering practice set, the five numbered sections below follow this topic map (one row per H2):
| # | Topic (sections 1–5) | Why it shows up at Shopify |
|---|---|---|
| 1 | SQL aggregation and DATE_TRUNC for monthly session counts | Monthly Session Count—DATE_TRUNC('month', session_ts) plus COUNT(*) plus ORDER BY month. |
| 2 | SQL DATE_TRUNC and COUNT for daily session counts | Daily Session Count—same shape with DATE_TRUNC('day', session_ts); per-shop variant adds GROUP BY shop_id, day. |
| 3 | SQL JOINs and date arithmetic for days-to-first-session per shop | Days to First Session Per Shop—MIN(session_ts) GROUP BY shop_id joined to shops.created_at with a date-difference projection. |
| 4 | SQL window functions for 7-day rolling average sessions per shop | 7-Day Rolling Average Sessions Per Shop (HARD)—AVG(sessions) OVER (PARTITION BY shop_id ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW). |
| 5 | SQL regex for UTM source extraction from landing URL | Extract UTM Source From Landing URL (MEDIUM)—SUBSTRING(url FROM 'utm_source=([^&]+)') with NULL fallback. |
Merchant-analytics framing rule: Shopify's prompts span pure merchant SQL—monthly traffic, daily traffic per shop, activation latency, rolling-trend smoothing, UTM attribution. The interviewer is grading whether you map each business framing to the right SQL primitive: monthly count →
DATE_TRUNC('month', …) + COUNT; daily count →DATE_TRUNC('day', …) + COUNT; days-to-first →MIN(date) GROUP BY entityJOIN baseline + date-diff; rolling average → window withROWS BETWEEN 6 PRECEDING AND CURRENT ROW; UTM extraction → regex capture group. State the mapping out loud.
1. SQL Aggregation and DATE_TRUNC for Monthly Session Counts
Date truncation and aggregation for monthly traffic in SQL for data engineering
"Build a monthly session count report" is the canonical merchant-analytics interview prompt. The mental model: DATE_TRUNC('month', session_ts) collapses every session in a calendar month to that month's first day; COUNT(*) GROUP BY <truncated date> rolls sessions up to one row per month. The same shape powers daily / weekly / quarterly rollups—swap the truncation argument and the grouping key follows.
Pro tip: Always cast or pin the truncated value to
DATE(notTIMESTAMP) before grouping; mixing time zones across aTIMESTAMP WITH TIME ZONEcolumn and aDATE_TRUNCcall will silently bucket rows wrong at month boundaries.DATE_TRUNC('month', session_ts)::dateis the safe form on PostgreSQL.
DATE_TRUNC('month', session_ts) collapses to month-start
DATE_TRUNC rounds a timestamp down to the nearest unit. The invariant: for any session_ts in March 2026, DATE_TRUNC('month', session_ts) returns 2026-03-01. Every March session maps to the same bucket key.
-
DATE_TRUNC('month', t)— first day of the month containingt. -
DATE_TRUNC('day', t)— strip time-of-day; useful whensession_tsis aTIMESTAMP. -
DATE_TRUNC('quarter', t)— first day of the quarter; pairs cleanly with quarterly traffic reports. -
DATE_TRUNC('year', t)— first day of the year; for annual rollups.
Worked example. Five sessions across two months. After DATE_TRUNC('month', session_ts):
| session_id | session_ts | month_bucket |
|---|---|---|
| 1 | 2026-03-04 09:12:00 | 2026-03-01 |
| 2 | 2026-03-19 14:27:00 | 2026-03-01 |
| 3 | 2026-03-31 23:55:00 | 2026-03-01 |
| 4 | 2026-04-02 06:01:00 | 2026-04-01 |
| 5 | 2026-04-25 18:30:00 | 2026-04-01 |
Worked-example solution.
SELECT
session_id,
session_ts,
DATE_TRUNC('month', session_ts)::date AS month_bucket
FROM sessions
ORDER BY session_ts;
Rule of thumb: whenever the prompt says "by month / by week / by quarter," reach for DATE_TRUNC before you reach for EXTRACT(MONTH FROM ...)—DATE_TRUNC keeps the year, so March 2025 and March 2026 stay distinct.
COUNT(*) over the month bucket
The aggregate invariant: COUNT(*) GROUP BY DATE_TRUNC('month', session_ts) produces one row per month with the total session count. Every monthly aggregate report follows this shape; only the truncation unit and aggregate function change.
-
COUNT(*)— counts every row in the bucket. -
COUNT(DISTINCT visitor_id)— distinct visitors instead of total sessions. -
SUM(amount)— totals amount instead of counting rows. -
AVG(duration_s)— averages a duration metric.
Worked example. Continuing the previous five sessions, count by month.
| month_bucket | rows | COUNT(*) |
|---|---|---|
| 2026-03-01 | 3 | 3 |
| 2026-04-01 | 2 | 2 |
Worked-example solution.
SELECT
DATE_TRUNC('month', session_ts)::date AS month,
COUNT(*) AS sessions
FROM sessions
GROUP BY DATE_TRUNC('month', session_ts)
ORDER BY month;
Rule of thumb: the GROUP BY expression must match the SELECT expression exactly (or use the column position alias) — engines disagree on letting you GROUP BY 1 for safety.
ORDER BY month for time-sorted output
The output invariant: GROUP BY does not guarantee output order. For a time-series report, append ORDER BY month to surface chronological output. Without it, downstream charting tools see arbitrary order.
-
ORDER BY month— ascending chronological. -
ORDER BY month DESC— most recent first; common for ops dashboards. -
ORDER BY month+ densification — for "include zero-traffic months," left-join against a generated calendar table. -
No
ORDER BYafterGROUP BY— flaky output across engines.
Worked example. Add a January 2026 session. Output should now span Jan, Mar, Apr—but Feb is correctly absent (no sessions).
| month | sessions |
|---|---|
| 2026-01-01 | 1 |
| 2026-03-01 | 3 |
| 2026-04-01 | 2 |
Worked-example solution.
SELECT
DATE_TRUNC('month', session_ts)::date AS month,
COUNT(*) AS sessions
FROM sessions
GROUP BY DATE_TRUNC('month', session_ts)
ORDER BY month ASC;
Rule of thumb: a time-series report without ORDER BY is a bug waiting for a code review; always sort the output explicitly.
Common beginner mistakes
- Using
EXTRACT(MONTH FROM session_ts)—collapses every March across years into one bucket; lose the year. - Forgetting
::datecast—onTIMESTAMP WITH TIME ZONEcolumns, the result drifts at month boundaries. - Aliasing the truncated column as
monthand thenGROUP BY month(engine-dependent—Postgres allows it, others don't); safer to repeat the expression. - Skipping
ORDER BY—non-deterministic chronological output. - Counting
COUNT(visitor_id)and being surprised byNULLs—useCOUNT(*)for total rows orCOUNT(DISTINCT visitor_id)for unique visitors.
SQL Interview Question on Monthly Session Count
Table sessions(session_id INT, session_ts TIMESTAMP). Return one row per month with the total count of sessions. Columns: month (first day of the month), sessions. Sort ascending by month.
Solution Using DATE_TRUNC and COUNT
SELECT
DATE_TRUNC('month', session_ts)::date AS month,
COUNT(*) AS sessions
FROM sessions
GROUP BY DATE_TRUNC('month', session_ts)
ORDER BY month ASC;
Step-by-step trace (input: 7 sessions across 3 months):
| session_id | session_ts |
|---|---|
| 1 | 2026-01-15 09:00:00 |
| 2 | 2026-03-04 11:00:00 |
| 3 | 2026-03-19 12:00:00 |
| 4 | 2026-03-31 23:50:00 |
| 5 | 2026-04-02 06:00:00 |
| 6 | 2026-04-25 18:00:00 |
| 7 | 2026-04-30 22:00:00 |
-
Scan
sessions— read all 7 rows. -
DATE_TRUNC('month', session_ts)— January row →2026-01-01; March rows →2026-03-01; April rows →2026-04-01. -
GROUP BY DATE_TRUNC('month', session_ts)— three groups: Jan (1 row), Mar (3 rows), Apr (3 rows). -
COUNT(*)per group — Jan = 1; Mar = 3; Apr = 3. -
Project +
ORDER BY month ASC— sort the three rows ascending.
Output:
| month | sessions |
|---|---|
| 2026-01-01 | 1 |
| 2026-03-01 | 3 |
| 2026-04-01 | 3 |
Why this works — concept by concept:
-
DATE_TRUNC('month', session_ts)— collapses every session in a calendar month to the same first-of-month key, giving a stable group key for the aggregate. -
::datecast — strips the time / timezone component so the bucket key is a cleanDATEvalue, avoiding boundary drift. -
COUNT(*)— counts every row in the month bucket; the canonical aggregate for traffic rollups. -
GROUP BYmatches the projected expression — safest across engines; some acceptGROUP BY month(the alias) but not all. - Sparse-by-default — months with zero sessions are absent from the output; correct unless the prompt asks for zero-row densification via a calendar table.
-
ORDER BY month ASC— gives the consumer a chronological time-series; no client-side sort needed. -
Cost — one scan of
sessions, one hash aggregate →O(N)time,O(M)space whereMis the number of distinct months.
SQL
Topic — aggregation
Aggregation problems
SQL
Shopify — aggregation
Shopify aggregation problems
2. SQL DATE_TRUNC and COUNT for Daily Session Counts
Daily session bucketing and per-shop traffic in SQL for data engineering
"Daily session count" is the same DATE_TRUNC + COUNT shape as monthly, with 'day' as the truncation unit. The mental model: DATE_TRUNC('day', session_ts) collapses to the date; COUNT(*) GROUP BY day is one row per day. The two extensions Shopify reaches for: per-shop daily counts (GROUP BY shop_id, day) and densification with a calendar series (LEFT JOIN generate_series(...) to fill zero-traffic days).
Pro tip: Daily reports on small samples can give misleading conclusions; merchant-analytics teams typically pair the daily count with a 7-day rolling average (covered in §4) to separate signal from day-of-week noise. State this trade-off out loud—interviewers reward awareness of the analytical context.
Daily bucketing: DATE_TRUNC('day', session_ts)
The daily-bucket invariant: DATE_TRUNC('day', session_ts) returns the date with time stripped. Every session in a calendar day maps to the same key. On a TIMESTAMP column the result is a TIMESTAMP at midnight; cast to DATE for cleaner output.
-
DATE_TRUNC('day', t)— strip time; result is the date at 00:00. -
session_ts::date— Postgres shorthand for the same thing. -
DATE(session_ts)— engine-specific synonym (MySQL, Snowflake). - Group key — bucket all intra-day sessions into the day's date.
Worked example. Three sessions on April 12 collapse to one bucket.
| session_ts | day_bucket |
|---|---|
| 2026-04-12 09:00:00 | 2026-04-12 |
| 2026-04-12 14:30:00 | 2026-04-12 |
| 2026-04-12 23:55:00 | 2026-04-12 |
| 2026-04-13 06:10:00 | 2026-04-13 |
Worked-example solution.
SELECT
session_ts,
DATE_TRUNC('day', session_ts)::date AS day_bucket
FROM sessions
ORDER BY session_ts;
Rule of thumb: on a pure DATE column, DATE_TRUNC is a no-op; use it when the source is a TIMESTAMP to be explicit about the truncation.
Densification with calendar generation
The densification invariant: GROUP BY day is sparse by default; days with zero sessions are absent from the output. To produce a contiguous day-by-day series, generate a calendar table and LEFT JOIN the sparse aggregate against it, replacing nulls with zero.
-
generate_series(start_date, end_date, '1 day')— Postgres calendar table generator. -
LEFT JOIN ON cal.day = agg.day— keep every calendar row, attach session count where present. -
COALESCE(agg.sessions, 0)— replace nulls with zero so charts don't break. -
Window-function alternative — for some prompts, a windowed
OVER (ORDER BY day)over the sparse rows is enough.
Worked example. Three days of session data with a gap on April 13.
| day | sessions (sparse) |
|---|---|
| 2026-04-12 | 3 |
| 2026-04-14 | 5 |
After densification with a calendar:
| day | sessions |
|---|---|
| 2026-04-12 | 3 |
| 2026-04-13 | 0 |
| 2026-04-14 | 5 |
Worked-example solution.
WITH cal AS (
SELECT generate_series('2026-04-12'::date, '2026-04-14'::date, '1 day')::date AS day
),
agg AS (
SELECT DATE_TRUNC('day', session_ts)::date AS day, COUNT(*) AS sessions
FROM sessions
GROUP BY DATE_TRUNC('day', session_ts)
)
SELECT cal.day, COALESCE(agg.sessions, 0) AS sessions
FROM cal
LEFT JOIN agg ON agg.day = cal.day
ORDER BY cal.day;
Rule of thumb: densify when downstream charts need contiguous days; leave sparse when the consumer just wants "days with traffic."
Per-shop daily counts: GROUP BY shop_id, day
The per-shop invariant: add shop_id to the GROUP BY to produce one row per (shop, day). The output is a 2D matrix in long form—cell (shop, day) = sessions. This is the canonical input shape for §4's rolling-window query.
-
GROUP BY shop_id, day— composite key; one row per cell. - Output shape — long form (one row per cell), not wide.
- Densification — cross-join calendar × shops, then left-join the aggregate.
-
Tie-break order —
ORDER BY shop_id, dayfor deterministic output.
Worked example. Two shops × three days of sessions.
| shop_id | day | sessions |
|---|---|---|
| 1 | 2026-04-12 | 3 |
| 1 | 2026-04-13 | 5 |
| 2 | 2026-04-12 | 1 |
| 2 | 2026-04-14 | 2 |
Worked-example solution.
SELECT
shop_id,
DATE_TRUNC('day', session_ts)::date AS day,
COUNT(*) AS sessions
FROM sessions
GROUP BY shop_id, DATE_TRUNC('day', session_ts)
ORDER BY shop_id, day;
Rule of thumb: the long-form (shop, day, sessions) triple is the right input format for any window-function rollup over per-shop time series.
Common beginner mistakes
- Using
session_tsdirectly inGROUP BY(without truncation)—every session goes to its own group, no aggregation. - Counting
COUNT(session_id)instead ofCOUNT(*)—usually identical, butCOUNT(session_id)skips NULL ids. - Forgetting to densify and surprising the consumer with a chart with gaps.
-
GROUP BY 1, 2—works on Postgres but breaks on strict-mode engines; repeat the expression to be safe. - Mixing
DATE_TRUNC('day', …)and a separateEXTRACT(DAY FROM …)—different output types; pick one.
SQL Interview Question on Daily Session Count Per Shop
Table sessions(session_id INT, shop_id INT, session_ts TIMESTAMP). Return one row per (shop_id, day) with the count of sessions. Columns: shop_id, day, sessions. Sort ascending by shop_id, then by day.
Solution Using DATE_TRUNC('day', ...) and a composite GROUP BY
SELECT
shop_id,
DATE_TRUNC('day', session_ts)::date AS day,
COUNT(*) AS sessions
FROM sessions
GROUP BY shop_id, DATE_TRUNC('day', session_ts)
ORDER BY shop_id ASC, day ASC;
Step-by-step trace (input: 6 rows across 2 shops × 3 days):
| session_id | shop_id | session_ts |
|---|---|---|
| 1 | 1 | 2026-04-12 09:00:00 |
| 2 | 1 | 2026-04-12 14:00:00 |
| 3 | 1 | 2026-04-13 11:00:00 |
| 4 | 2 | 2026-04-12 16:00:00 |
| 5 | 2 | 2026-04-14 08:30:00 |
| 6 | 2 | 2026-04-14 22:00:00 |
-
Scan
sessions— read all 6 rows. -
Truncate —
2026-04-12,2026-04-13,2026-04-14per row. -
GROUP BY shop_id, day— four groups:(1, 2026-04-12),(1, 2026-04-13),(2, 2026-04-12),(2, 2026-04-14). -
COUNT(*)—(1, 2026-04-12) = 2,(1, 2026-04-13) = 1,(2, 2026-04-12) = 1,(2, 2026-04-14) = 2. -
ORDER BY shop_id, day— alphabetic shops then chronological days.
Output:
| shop_id | day | sessions |
|---|---|---|
| 1 | 2026-04-12 | 2 |
| 1 | 2026-04-13 | 1 |
| 2 | 2026-04-12 | 1 |
| 2 | 2026-04-14 | 2 |
Why this works — concept by concept:
-
DATE_TRUNC('day', session_ts)— collapses every intra-day timestamp to the calendar date; the canonical day key. -
Composite
GROUP BY (shop_id, day)— produces one row per (shop, day) cell, exactly what merchant analytics dashboards consume. -
COUNT(*)— total session count per cell; the simplest possible traffic metric. - Long-form output — one row per cell rather than a wide pivot keeps the result compatible with downstream tools that prefer long format.
-
ORDER BY shop_id, day— deterministic output; charting libraries can render directly without a sort step. - Sparse-by-default — days with no sessions for a shop are absent; densify with a calendar cross-join only if the consumer needs zero-rows.
-
Cost — one scan of
sessions, one hash aggregate over the composite key →O(N)time,O(S × D)space whereSis shops andDis distinct days.
SQL
Topic — date functions
Date-functions problems
SQL
Shopify — date functions
Shopify date-functions problems
3. SQL JOINs and Date Arithmetic for Days-to-First-Session Per Shop
Activation latency via JOIN and date difference in SQL for data engineering
"For each shop, how many days from created_at until the first session?" is the canonical activation-latency interview prompt. The mental model: MIN(session_ts) GROUP BY shop_id gives the first-touch per shop; JOIN to shops.created_at to anchor the baseline; subtract dates to produce the latency in days. This is the bread-and-butter pattern for any "time-to-first-X" question—time-to-first-purchase, time-to-first-payout, time-to-first-active-user.
Pro tip: Use a
LEFT JOINfromshopsto the first-session CTE—not the other way round. Some shops never had a session; you want them in the output withNULL(or "never activated") rather than dropped silently. The interviewer will probe the missing-side case if you skip it.
MIN(session_ts) GROUP BY shop_id for first-touch
The first-touch invariant: MIN(session_ts) GROUP BY shop_id collapses each shop to one row whose timestamp is the shop's first session. Repeated sessions by the same shop contribute zero to "first" because we keep only the minimum.
-
MIN(session_ts)— earliest session per shop. -
GROUP BY shop_id— one row per shop. -
Optional bound —
WHERE session_ts >= '2026-04-01'if the report is windowed. -
Cast —
MIN(session_ts)::dateif you want the date only (no time component).
Worked example. Five sessions across three shops.
| session_id | shop_id | session_ts |
|---|---|---|
| 1 | 1 | 2026-04-12 09:00:00 |
| 2 | 1 | 2026-04-15 12:00:00 |
| 3 | 2 | 2026-04-13 14:00:00 |
| 4 | 3 | 2026-04-20 16:00:00 |
| 5 | 3 | 2026-04-22 18:00:00 |
After MIN(session_ts) GROUP BY shop_id:
| shop_id | first_session_ts |
|---|---|
| 1 | 2026-04-12 09:00:00 |
| 2 | 2026-04-13 14:00:00 |
| 3 | 2026-04-20 16:00:00 |
Worked-example solution.
SELECT
shop_id,
MIN(session_ts) AS first_session_ts
FROM sessions
GROUP BY shop_id;
Rule of thumb: "first time a thing happened per entity" → MIN(timestamp) GROUP BY entity_id.
JOIN to shops.created_at to anchor the baseline
The JOIN invariant: the activation baseline lives on the shops table; the first session lives on sessions. JOIN them on shop_id to bring both timestamps into one row. Use LEFT JOIN from shops so shops that never had a session still appear in the output.
-
LEFT JOIN first_touch ON first_touch.shop_id = shops.id— keep every shop. -
shops.created_at— the activation baseline. -
first_touch.first_session_ts— the first-session timestamp (NULL if never activated). -
COALESCE(first_session_ts::date - shops.created_at::date, NULL)— latency in days; remains NULL for never-activated shops.
Worked example. Three shops with created_at; only the first two had a session.
| shop_id | created_at | first_session_ts |
|---|---|---|
| 1 | 2026-04-10 | 2026-04-12 09:00:00 |
| 2 | 2026-04-11 | 2026-04-13 14:00:00 |
| 3 | 2026-04-15 | NULL |
Worked-example solution.
WITH first_touch AS (
SELECT shop_id, MIN(session_ts) AS first_session_ts
FROM sessions
GROUP BY shop_id
)
SELECT
s.id AS shop_id,
s.created_at,
ft.first_session_ts
FROM shops s
LEFT JOIN first_touch ft ON ft.shop_id = s.id
ORDER BY s.id;
Rule of thumb: the JOIN brings the baseline onto the metric row; in activation funnels, the baseline is always the entity's creation time.
Date difference: MIN(session_ts)::date - shops.created_at::date
The difference invariant: (date_a - date_b) returns an integer count of days (in PostgreSQL; engine-specific elsewhere—DATEDIFF on Snowflake/MySQL). Cast both to DATE to get day granularity; leave as TIMESTAMP for sub-day precision.
-
(d1 - d2)— Postgres day-difference (returns integer). -
DATEDIFF('day', d2, d1)— Snowflake / Databricks. -
DATEDIFF(d1, d2)— MySQL. -
EXTRACT(EPOCH FROM (t1 - t2)) / 86400.0— Postgres for fractional days.
Worked example. Continuing the previous CTE.
| shop_id | created_at | first_session_ts | days_to_first_session |
|---|---|---|---|
| 1 | 2026-04-10 | 2026-04-12 | 2 |
| 2 | 2026-04-11 | 2026-04-13 | 2 |
| 3 | 2026-04-15 | NULL | NULL |
Worked-example solution.
WITH first_touch AS (
SELECT shop_id, MIN(session_ts)::date AS first_day
FROM sessions
GROUP BY shop_id
)
SELECT
s.id AS shop_id,
s.created_at::date AS created_day,
ft.first_day,
(ft.first_day - s.created_at::date) AS days_to_first_session
FROM shops s
LEFT JOIN first_touch ft ON ft.shop_id = s.id
ORDER BY s.id;
Rule of thumb: always cast both sides to DATE before subtracting; mixing TIMESTAMP - TIMESTAMP returns an INTERVAL, not an integer.
Common beginner mistakes
- Using
INNER JOINand dropping never-activated shops—useLEFT JOINto keep them with NULL. - Forgetting to cast both sides to
DATE—TIMESTAMP - TIMESTAMPreturns anINTERVAL. - Computing
days = (first - created) / 86400.0from epoch seconds—correct but verbose; use the date subtraction form. -
GROUP BY shop_id, created_atwithout a CTE—mixes aggregation and per-row selection in confusing ways. - Treating
MIN(session_ts) < created_atas impossible—data quality issues happen; consider whether to filter or surface them.
SQL Interview Question on Days to First Session Per Shop
Tables shops(id INT, created_at TIMESTAMP) and sessions(session_id INT, shop_id INT, session_ts TIMESTAMP). Return one row per shop with shop_id, created_at (date), and days_to_first_session (integer) — the day count between created_at and the shop's first session. For shops with no session, return NULL for days_to_first_session. Sort ascending by shop_id.
Solution Using a first-touch CTE and a LEFT JOIN with date subtraction
WITH first_touch AS (
SELECT shop_id, MIN(session_ts)::date AS first_day
FROM sessions
GROUP BY shop_id
)
SELECT
s.id AS shop_id,
s.created_at::date AS created_at,
(ft.first_day - s.created_at::date) AS days_to_first_session
FROM shops s
LEFT JOIN first_touch ft ON ft.shop_id = s.id
ORDER BY s.id ASC;
Step-by-step trace (input: 4 shops, 5 sessions; shop 4 never activated):
| shops: id | created_at |
|---|---|
| 1 | 2026-04-10 08:00:00 |
| 2 | 2026-04-11 09:00:00 |
| 3 | 2026-04-12 10:00:00 |
| 4 | 2026-04-15 11:00:00 |
| sessions: session_id | shop_id | session_ts |
|---|---|---|
| 1 | 1 | 2026-04-12 14:00:00 |
| 2 | 1 | 2026-04-15 09:00:00 |
| 3 | 2 | 2026-04-13 16:00:00 |
| 4 | 3 | 2026-04-12 11:00:00 |
| 5 | 3 | 2026-04-14 12:00:00 |
-
Build
first_touch—MIN(session_ts)::date GROUP BY shop_id:- Shop 1 →
2026-04-12 - Shop 2 →
2026-04-13 - Shop 3 →
2026-04-12 - Shop 4 → no row (never activated).
- Shop 1 →
-
LEFT JOIN—shops(4 rows) joined tofirst_touch(3 rows). Shop 4'sfirst_dayisNULL. -
Date subtraction —
- Shop 1:
2026-04-12 - 2026-04-10 = 2 - Shop 2:
2026-04-13 - 2026-04-11 = 2 - Shop 3:
2026-04-12 - 2026-04-12 = 0 - Shop 4:
NULL - 2026-04-15 = NULL
- Shop 1:
-
ORDER BY shop_id ASC— final ascending.
Output:
| shop_id | created_at | days_to_first_session |
|---|---|---|
| 1 | 2026-04-10 | 2 |
| 2 | 2026-04-11 | 2 |
| 3 | 2026-04-12 | 0 |
| 4 | 2026-04-15 | NULL |
Why this works — concept by concept:
-
MIN(session_ts) GROUP BY shop_id— collapses each shop into one row whose date is its first interaction; the canonical "first" definition. -
::datecast onMIN— strips the time component so the subtraction returns an integer day count, not an interval. -
LEFT JOINfromshops— keeps every shop in the output; shops that never appeared insessionsshow NULL forfirst_dayand therefore NULL for the latency, which is exactly the desired semantics. -
Date subtraction
(first_day - created_at::date)— Postgres returns an integer day-count; engine-specificDATEDIFFon other dialects. -
NULL-propagating arithmetic —
NULL - date = NULLautomatically, so never-activated shops don't need a special-case branch. -
ORDER BY shop_id ASC— deterministic output regardless of physical row order. -
Cost — one scan of
sessionsfor the CTE, one hash join + one scan ofshops, one scalar projection →O(N + M)time whereN= sessions andM= shops.
SQL
Topic — joins
JOIN problems
SQL
Topic — date arithmetic
Date-arithmetic problems
4. SQL Window Functions for 7-Day Rolling Average Sessions Per Shop
Frame-clause window functions for trailing rolling averages in SQL for data engineering
"For each shop, compute a 7-day trailing rolling average of daily sessions" is the Hard-tier interview prompt for merchant analytics. The mental model: AVG(sessions) OVER (PARTITION BY shop_id ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) averages the current day's sessions plus the prior six days into a single value—repeated for every (shop, day) cell. The frame clause ROWS BETWEEN 6 PRECEDING AND CURRENT ROW is the entire trick; without it, the default frame includes everything from partition start to current row, which is a cumulative average, not a 7-day rolling.
Pro tip: The default window frame for an
ORDER BYwindow isRANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW—a cumulative window, not a rolling one. Always state the frame explicitly:ROWS BETWEEN N PRECEDING AND CURRENT ROWfor trailing N+1;ROWS BETWEEN N PRECEDING AND N FOLLOWINGfor centered. Default frames bite candidates who thinkOVER (ORDER BY day)alone gives a rolling average.
PARTITION BY shop_id ORDER BY day for per-shop time windows
The window-partition invariant: PARTITION BY shop_id slices the row set into independent per-shop windows; ORDER BY day defines what "previous" means within each partition. The rolling average never crosses shops because the partition isolates each.
-
PARTITION BY shop_id— independent per-shop frames. -
ORDER BY day— chronological order within partition. -
No
PARTITION BY— the window spans the whole table (one global rolling average). -
ORDER BY day ASC— explicit ascending;DESCwould invert what "preceding" means.
Worked example. Two shops × 7 days each. Partition splits the 14 rows into two independent windows.
| shop_id | day | sessions |
|---|---|---|
| 1 | 2026-04-08 | 50 |
| 1 | 2026-04-09 | 60 |
| 1 | 2026-04-10 | 55 |
| ... | ... | ... |
| 2 | 2026-04-08 | 30 |
| 2 | 2026-04-09 | 40 |
| ... | ... | ... |
Worked-example solution.
SELECT
shop_id, day, sessions,
COUNT(*) OVER (PARTITION BY shop_id ORDER BY day) AS rownum_so_far
FROM daily_sessions
ORDER BY shop_id, day;
Rule of thumb: if the prompt says "per shop" or "per user" with a time series, PARTITION BY entity ORDER BY time is the correct skeleton.
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW frame clause
The frame invariant: ROWS BETWEEN N PRECEDING AND CURRENT ROW includes the current row and the N rows physically before it in the partition's order. For a 7-day trailing window, N = 6 (current + 6 preceding = 7 total).
-
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW— last 7 rows. -
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW— cumulative. -
ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING— 7-row centered window. -
RANGE BETWEEN INTERVAL '6 days' PRECEDING AND CURRENT ROW— 7 calendar days regardless of gaps.
Worked example. 10 daily session counts for shop 1; show the trailing-7 frame at three different days.
| day | sessions | frame at this day | window contents |
|---|---|---|---|
| 04-08 | 50 | 1 row | {50} |
| 04-09 | 60 | 2 rows | {50, 60} |
| 04-14 | 75 | 7 rows | {50, 60, 55, 70, 65, 80, 75} |
Worked-example solution.
SELECT
shop_id, day, sessions,
COUNT(*) OVER (
PARTITION BY shop_id
ORDER BY day
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS frame_size
FROM daily_sessions
ORDER BY shop_id, day;
-- frame_size grows from 1 (day 1) up to 7 (day 7+) and stays at 7 thereafter.
Rule of thumb: ROWS counts physical rows; RANGE counts logical values (good for date-aware windows on irregular series). Pick ROWS for "last 7 daily rows," RANGE for "last 7 calendar days regardless of gaps."
AVG(sessions) over the bounded frame for trailing 7-day average
The aggregate-over-frame invariant: placing AVG(sessions) in front of the OVER (...) clause computes the average over the frame for each row. Repeat for every (shop, day) cell; the engine slides the frame by one row at a time.
-
AVG(sessions) OVER (...)— rolling average. -
SUM(sessions) OVER (...)— rolling sum. -
MAX(sessions) OVER (...)— rolling max. -
COUNT(*) OVER (...)— frame size; 1, 2, ..., 7, 7, 7 for aROWS BETWEEN 6 PRECEDING AND CURRENT ROWframe.
Worked example. Shop 1, 10 daily session counts.
| day | sessions | trailing_7_avg |
|---|---|---|
| 04-08 | 50 | 50.00 (1 row) |
| 04-09 | 60 | 55.00 (2 rows) |
| 04-10 | 55 | 55.00 (3 rows) |
| 04-11 | 70 | 58.75 (4 rows) |
| 04-12 | 65 | 60.00 (5 rows) |
| 04-13 | 80 | 63.33 (6 rows) |
| 04-14 | 75 | 65.00 (7 rows) |
| 04-15 | 90 | 70.71 (last 7) |
Worked-example solution.
SELECT
shop_id,
day,
sessions,
ROUND(
AVG(sessions) OVER (
PARTITION BY shop_id
ORDER BY day
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
),
2
) AS trailing_7_avg
FROM daily_sessions
ORDER BY shop_id, day;
Rule of thumb: always ROUND(..., 2) the rolling average for readable output; trailing decimals look noisy.
Common beginner mistakes
- Omitting the frame clause and getting a cumulative average instead of a rolling one.
- Using
RANGE BETWEEN 6 PRECEDING AND CURRENT ROWon an integer column—works but on a date column it counts logical days, not rows; pick deliberately. - Forgetting
PARTITION BY shop_id—the average bleeds across shops. - Computing the rolling average from raw
sessionsrows rather than from a pre-aggregated daily CTE—correct only if there's exactly one row per (shop, day). -
ORDER BY day DESCinside the window—inverts what "preceding" means; the rolling average becomes a leading-7 instead of trailing.
SQL Interview Question on 7-Day Rolling Average Sessions Per Shop
Table daily_sessions(shop_id INT, day DATE, sessions INT) with one row per (shop_id, day). Return shop_id, day, sessions, and trailing_7_avg — the rolling average of daily sessions over the current day plus the 6 preceding days, per shop. Round trailing_7_avg to 2 decimals. Sort ascending by shop_id, then by day.
Solution Using AVG(...) OVER (PARTITION BY shop_id ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)
SELECT
shop_id,
day,
sessions,
ROUND(
AVG(sessions) OVER (
PARTITION BY shop_id
ORDER BY day
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
)::numeric,
2
) AS trailing_7_avg
FROM daily_sessions
ORDER BY shop_id ASC, day ASC;
Step-by-step trace (input: 1 shop × 8 days):
| shop_id | day | sessions |
|---|---|---|
| 1 | 2026-04-08 | 50 |
| 1 | 2026-04-09 | 60 |
| 1 | 2026-04-10 | 55 |
| 1 | 2026-04-11 | 70 |
| 1 | 2026-04-12 | 65 |
| 1 | 2026-04-13 | 80 |
| 1 | 2026-04-14 | 75 |
| 1 | 2026-04-15 | 90 |
- Partition — only one shop, so one partition with 8 rows.
-
Order — by
dayascending. -
Frame —
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW; the frame has 1 row on day 1, grows to 7 by day 7, stays at 7 for day 8. -
AVG(sessions)per row —- Day 1: AVG of {50} = 50.00.
- Day 2: AVG of {50, 60} = 55.00.
- Day 3: AVG of {50, 60, 55} = 55.00.
- Day 4: AVG of {50, 60, 55, 70} = 58.75.
- Day 5: AVG of {50, 60, 55, 70, 65} = 60.00.
- Day 6: AVG of {50, 60, 55, 70, 65, 80} = 63.33.
- Day 7: AVG of {50, 60, 55, 70, 65, 80, 75} = 65.00.
- Day 8: AVG of {60, 55, 70, 65, 80, 75, 90} = 70.71 (the day-1 value drops out as the window slides).
-
ROUND(..., 2)— clamp to 2 decimals.
Output:
| shop_id | day | sessions | trailing_7_avg |
|---|---|---|---|
| 1 | 2026-04-08 | 50 | 50.00 |
| 1 | 2026-04-09 | 60 | 55.00 |
| 1 | 2026-04-10 | 55 | 55.00 |
| 1 | 2026-04-11 | 70 | 58.75 |
| 1 | 2026-04-12 | 65 | 60.00 |
| 1 | 2026-04-13 | 80 | 63.33 |
| 1 | 2026-04-14 | 75 | 65.00 |
| 1 | 2026-04-15 | 90 | 70.71 |
Why this works — concept by concept:
-
PARTITION BY shop_id— slices the row set into per-shop windows so the rolling average is computed independently per shop; never bleeds across shops. -
ORDER BY day— defines the chronological order that "preceding" depends on; without it, "preceding" is undefined. -
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW— the explicit frame clause; bounds the average to the current row plus the six rows before it for a 7-row trailing window. -
AVG(sessions) OVER (...)— the aggregate-over-frame; the engine slides the frame by one row at each cell and averages the frame'ssessionsvalues. - Ramp-up at partition start — for the first six days per shop, the frame is shorter than 7 rows; the average uses whatever's available, which is the standard rolling-average convention (avoid silently emitting NULL just because there isn't a full 7-day history).
-
ROUND(..., 2)— clamps the floating-point output to two decimals for readable charting. -
Cost — one scan of
daily_sessions, one window pass with a partition-sort buffer of size 7 →O(N log N)worst-case if the engine sorts inside the window; typicallyO(N)with a hash-partition operator.
SQL
Topic — window functions
Window-function problems
SQL
Topic — moving average
Moving-average problems
5. SQL Regex for UTM Source Extraction From Landing URL
Regex extraction from query strings in SQL for data engineering
"Extract the utm_source value from a landing URL" is the canonical regex interview prompt for marketing-attribution work. The mental model: the landing URL has a query string ?utm_source=google&utm_medium=cpc&utm_campaign=spring; the regex pattern utm_source=([^&]+) matches the literal utm_source= then captures everything until the next & or end of string. The SUBSTRING(url FROM 'pattern') form returns the capture group; missing parameters return NULL cleanly.
Pro tip: Different engines spell this differently. Postgres:
SUBSTRING(url FROM 'utm_source=([^&]+)'). Snowflake:REGEXP_SUBSTR(url, 'utm_source=([^&]+)', 1, 1, 'e', 1). MySQL:REGEXP_SUBSTR(url, 'utm_source=[^&]+')thenSUBSTRING_INDEX. Ask the interviewer which dialect to target before writing the query.
Regex pattern: utm_source=([^&]+)
The regex invariant: utm_source= is a literal anchor; ([^&]+) captures one or more characters that are not &. The capture group (...) is what gets extracted; the outer literal anchors the match to the right query parameter.
-
utm_source=— literal; matches that exact substring. -
(...)capture group — the substring to extract. -
[^&]+— one or more characters that are not&(the query-string delimiter). -
+quantifier — one or more; use*for zero-or-more (matchesutm_source=with no value).
Worked example. Three landing URLs.
| url | match | capture |
|---|---|---|
/landing?utm_source=google&utm_medium=cpc |
utm_source=google |
google |
/landing?ref=email&utm_source=fb |
utm_source=fb |
fb |
/landing?utm_medium=cpc |
(no match) | NULL |
Worked-example solution.
-- Pattern walkthrough
SELECT
'/landing?utm_source=google&utm_medium=cpc' AS url,
SUBSTRING('/landing?utm_source=google&utm_medium=cpc' FROM 'utm_source=([^&]+)') AS utm_source;
-- utm_source = 'google'
Rule of thumb: always anchor the literal (here utm_source=); a bare [^&]+ matches the very first param value, not the right one.
Capture group extraction with SUBSTRING(url FROM 'pattern')
The extraction invariant: SUBSTRING(string FROM regex) returns the matching capture group, or the entire match if no capture group is given. The Postgres function is the cleanest cross-dialect-portable form.
-
SUBSTRING(url FROM 'utm_source=([^&]+)')— Postgres; returns capture group. -
REGEXP_SUBSTR(url, 'utm_source=[^&]+')— Snowflake / Oracle; returns the entire match (would includeutm_source=). -
REGEXP_REPLACE(url, '.*utm_source=([^&]+).*', '\1')— alternative; replace whole string with capture group. -
split_part(split_part(url, 'utm_source=', 2), '&', 1)— non-regex alternative; cleaner on engines without strong regex support.
Worked example. Apply SUBSTRING ... FROM to four URLs.
| url | extracted utm_source |
|---|---|
/landing?utm_source=google&utm_medium=cpc |
google |
/landing?ref=email&utm_source=fb |
fb |
/landing?utm_medium=cpc |
NULL |
/landing |
NULL |
Worked-example solution.
SELECT
url,
SUBSTRING(url FROM 'utm_source=([^&]+)') AS utm_source
FROM landings;
Rule of thumb: SUBSTRING(... FROM 'pattern') is the cleanest form; if you need cross-engine portability, fall back to split_part chained calls.
NULL fallback when the parameter is absent
The fallback invariant: regex extraction returns NULL when the pattern doesn't match; downstream consumers usually want a labeled fallback like 'unknown'. Wrap with COALESCE(extracted, 'unknown') to surface a non-NULL value when the parameter is missing.
-
COALESCE(extracted, 'unknown')— labeled fallback. -
COALESCE(extracted, 'direct')— common attribution convention; "no UTM = direct traffic." -
CASE WHEN extracted IS NULL THEN 'direct' ELSE extracted END— equivalent; verbose. -
NULLIF(extracted, '')— defensive; treats empty string as NULL too.
Worked example. Apply COALESCE to the previous result.
| url | utm_source | with_fallback |
|---|---|---|
/landing?utm_source=google&utm_medium=cpc |
google |
google |
/landing?ref=email&utm_source=fb |
fb |
fb |
/landing?utm_medium=cpc |
NULL | direct |
/landing |
NULL | direct |
Worked-example solution.
SELECT
url,
COALESCE(
SUBSTRING(url FROM 'utm_source=([^&]+)'),
'direct'
) AS utm_source
FROM landings;
Rule of thumb: the convention in marketing analytics is 'direct' for missing UTM, not NULL — surface it consistently so downstream attribution reports have a single category for "no source."
Common beginner mistakes
- Using
LIKE '%utm_source=%'instead of regex—matches presence but doesn't extract the value. - Forgetting the capture group
(...)and getting backutm_source=googleinstead ofgoogle. - Using
[^&]*(zero-or-more) and matching empty string when the parameter has no value—use[^&]+(one-or-more) unless you specifically want to match empty values. - Skipping the
COALESCE-to-'direct'step—downstream attribution reports showNULLas a separate category. - Anchoring with
^utm_source=and missing matches whenutm_sourceisn't the first parameter.
SQL Interview Question on UTM Source Extraction
Table landings(landing_id INT, url TEXT). Return one row per landing with landing_id, url, and utm_source — the extracted value of the utm_source query parameter. If the URL has no utm_source, return the literal string 'direct'. Sort ascending by landing_id.
Solution Using SUBSTRING(url FROM 'utm_source=([^&]+)') with COALESCE
SELECT
landing_id,
url,
COALESCE(
SUBSTRING(url FROM 'utm_source=([^&]+)'),
'direct'
) AS utm_source
FROM landings
ORDER BY landing_id ASC;
Step-by-step trace (input: 4 landings):
| landing_id | url |
|---|---|
| 1 | /landing?utm_source=google&utm_medium=cpc&utm_campaign=spring |
| 2 | /landing?ref=email&utm_source=facebook&utm_medium=social |
| 3 | /landing?utm_medium=cpc |
| 4 | /landing |
-
Row 1 — pattern
utm_source=([^&]+)matchesutm_source=google; capture group isgoogle.COALESCE('google', 'direct')returnsgoogle. -
Row 2 — pattern matches
utm_source=facebook; capture group isfacebook.COALESCE('facebook', 'direct')returnsfacebook. -
Row 3 — pattern doesn't match (
utm_sourcenot present);SUBSTRING ... FROMreturns NULL.COALESCE(NULL, 'direct')returns'direct'. -
Row 4 — pattern doesn't match; same as row 3, returns
'direct'. -
ORDER BY landing_id ASC— final ascending order.
Output:
| landing_id | url | utm_source |
|---|---|---|
| 1 | /landing?utm_source=google&utm_medium=cpc&utm_campaign=spring |
|
| 2 | /landing?ref=email&utm_source=facebook&utm_medium=social |
|
| 3 | /landing?utm_medium=cpc |
direct |
| 4 | /landing |
direct |
Why this works — concept by concept:
-
Literal anchor
utm_source=— pins the regex to the right parameter;[^&]+alone would match any first-encountered query value. -
Capture group
([^&]+)— selects only the value (everything afterutm_source=up to the next&or end of string), excluding the anchor literal. -
[^&]character class — "any character except ampersand"; the only delimiter that ends a query-parameter value. -
+quantifier — one or more; ignores emptyutm_source=cases (use*if you want them). -
SUBSTRING(... FROM 'pattern')— Postgres returns the capture group when the pattern has one; cleaner thanREGEXP_SUBSTRvariants that return the entire match. -
COALESCE(extracted, 'direct')— labeled fallback for the no-UTM case; downstream attribution reports get a single bucket for missing-source instead of NULL leakage. -
ORDER BY landing_id ASC— deterministic output regardless of physical row order. -
Cost — one scan of
landings, one regex evaluation per row →O(N · L)time whereLis average URL length (regex is linear in input length).
SQL
Topic — regular expressions
Regex problems
SQL
Topic — string manipulation
String-manipulation problems
Tips to Crack Shopify Data Engineering Interviews
SQL is the entire loop—skip Python prep
Every problem in the curated Shopify SQL practice set is SQL. There is no Python in this loop. Spending interview-prep time on Python algorithms is a misallocation; spend it on DATE_TRUNC, the rolling-window frame clause, regex extraction, and MIN(date) GROUP BY entity first-touch CTEs. Expect 3–5 SQL coding problems across the loop, all framed as merchant-session analytics.
Drill the rolling-window frame clause until automatic
The Hard-tier #555 7-Day Rolling Average problem hinges on one structural choice: writing ROWS BETWEEN 6 PRECEDING AND CURRENT ROW instead of relying on the default frame. The default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW—a cumulative average, not a rolling one. Drill the window-functions practice page and the moving-average practice page until the explicit frame clause is muscle memory.
Merchant-analytics framing on every prompt
Frame your verbal walkthrough in merchant-analytics vocabulary: shops, sessions, days, months, activations, attribution, UTM sources. Interviewers test whether you can translate the abstract SQL primitive to the business question—matching the interviewer's mental model of what their team builds. The Shopify easy practice set leans on these framings; mirror that vocabulary in your answer.
Regex in SQL is a real, expected primitive
556 UTM Source Extraction is the medium-tier proof that regex is part of the SQL toolkit at Shopify. Memorize the Postgres form SUBSTRING(url FROM 'utm_source=([^&]+)') and the Snowflake form REGEXP_SUBSTR(...). Always pair the extraction with a COALESCE(..., 'direct') for the no-UTM case. Drill the regular-expressions practice page until you can write the pattern from memory.
Where to practice on PipeCode
Start with the Shopify practice page for the curated 5-problem set. After that, drill the matching topic pages: aggregation, date-functions, window-functions, moving-average, regular-expressions. The interview courses page bundles the SQL course if you want a structured curriculum before company drills. For broad coverage, browse by topic.
Communication and approach under time pressure
For every prompt, narrate the structure before writing code: "I'll truncate session_ts to month, COUNT, GROUP BY, ORDER BY ascending." Interviewers grade process as much as the final answer. Leave 5 minutes at the end of each problem for an edge-case sweep: empty input, NULL created_at, never-activated shops, missing utm_source, single-row partitions in window functions. The most common "almost passed" failure mode is correct happy-path code that crashes on the empty input—a 30-second sweep prevents it.
Frequently Asked Questions
What is the Shopify data engineering interview process like?
Shopify's data engineering interview opens with a recruiter screen, then a technical phone screen with one SQL coding problem, then an onsite (or virtual onsite) of four to five rounds: two to three SQL coding rounds (DATE_TRUNC rollups, JOIN + first-touch, rolling-window aggregates, regex extraction), one data-modeling discussion (sessions, shops, attribution dimensional models), and one to two behavioral rounds. Shopify does not test Python in the data engineering loop in this curated set; every coding round is SQL on merchant-session analytics.
What SQL topics does Shopify test for data engineers?
Shopify SQL interviews concentrate on five primitives that correspond directly to the curated 5-problem set: DATE_TRUNC for monthly and daily rollups (#553, daily count), MIN(timestamp) GROUP BY entity plus JOIN plus date subtraction for activation latency (#554), AVG(...) OVER (PARTITION BY ... ORDER BY ... ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) for rolling averages (#555 — HARD), and regex SUBSTRING(url FROM 'pattern') for query-string extraction (#556 — MEDIUM). Memorize these five patterns and you will recognize most Shopify prompts within the first 60 seconds.
Does Shopify test Python in data engineering interviews?
The curated Shopify practice set is 5 SQL, 0 Python. Shopify's data engineering interview is SQL-only in the curated loop. If a Python question appears, treat it as a different role family (data scientist, analytics engineer with Python responsibilities) rather than the canonical data-engineering loop covered in this guide. Spending prep time on Python algorithms for a Shopify data-engineering interview is a misallocation; SQL fluency is the entire signal.
How hard are Shopify data engineering interview questions?
The Shopify set is 3 easy + 1 medium + 1 hard. The Hard-tier problem (#555 7-Day Rolling Average Sessions Per Shop) is the make-or-break round; without explicit ROWS BETWEEN 6 PRECEDING AND CURRENT ROW, candidates produce a cumulative average instead. The MEDIUM problem (#556 Extract UTM Source From Landing URL) is regex pattern-recognition once you have seen utm_source=([^&]+). The three EASY problems are DATE_TRUNC + COUNT rollups and a JOIN-with-date-difference; interviewers expect zero hesitation and clean tie-break handling.
Are window functions and rolling averages common in Shopify interviews?
Yes—#555 7-Day Rolling Average Sessions Per Shop is explicit. If you cannot fluently write AVG(sessions) OVER (PARTITION BY shop_id ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW), you cannot pass the Shopify HARD round. Drill the window-functions practice page and the moving-average practice page until the frame clause is muscle memory.
How many Shopify practice problems should I solve before the interview?
Solve all 5 problems on the Shopify practice page end-to-end—untimed first, then timed at 25 minutes per problem. After that, broaden to 30 to 40 additional SQL problems spread across the matching topic pages: aggregation, date-functions, joins, window-functions, moving-average, regular-expressions. The Shopify SQL practice page and the Shopify easy practice page are the right surfaces for the final week of prep.


![Diagram with a landing URL containing utm_source query parameters, a regex pattern utm_source=([^&]+) highlighting the capture group, and an output cell showing the extracted source value.](https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdogvc6opnrt5p6mblmsm.jpeg)
Top comments (0)