Snowflake data engineering interview questions split into two distinct loops that share a name. The Snowflake-the-company SWE / DE interview is LeetCode-style Python: array iteration with set-based validation logic (the classic SET card-game rule, len(set(values)) in {1, 3} per attribute) and hash-table sliding-window counters over strings (Counter(substrings) driven by a rolling character-frequency dict). The Snowflake-as-tool data-engineering interview — for any DE role at any company that runs on Snowflake — is product-knowledge plus SQL: three-layer architecture, micro-partitions, clustering keys, Time Travel and Fail-safe, Zero-Copy Cloning, Snowpipe ingestion, and the LAG/LEAD/AVG ... OVER (PARTITION BY ...) window-function primitives that drive consecutive-streak detection, day-over-day deltas, and monthly-aggregate analytics.
This guide walks four topic clusters end-to-end, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches a curated 2-problem set (1 EASY Python array + 1 MEDIUM Python hash-table-sliding-window) plus two adjacent primitives — SQL window functions and Snowflake architecture — that every Snowflake-flavored interview rotates through. Whether the candidate is interviewing AT Snowflake or for a data-engineering role at a company that builds on Snowflake, the four primitives below cover the bar.
Top Snowflake data engineering interview topics
From the Snowflake data engineering practice set, the four numbered sections below follow this topic map (one row per H2):
| # | Topic (sections 1–4) | Why it shows up at Snowflake |
|---|---|---|
| 1 | Python arrays and set validation for the SET card game | SET Card Game Validation (EASY) — zip(*cards) per-attribute iteration plus len(set(vals)) in {1, 3}, the all-same-or-all-different invariant that powers any "validate a row across N records" pipeline. |
| 2 | Python hash tables and sliding window for maximum substring occurrences | Maximum Substring Occurrences (MEDIUM) — rolling character-frequency dict over a length-k window plus a Counter keyed on the substring itself, the canonical pattern for "count distinct windows under a constraint." |
| 3 | SQL window functions (LAG, LEAD, AVG OVER) for Snowflake analytics |
Marketing-touch streak detection — LAG(DATE_TRUNC('week', event_date)) OVER (PARTITION BY contact_id ORDER BY ...) plus CTE composition, the SQL primitive every Snowflake-flavored interview rotates through. |
| 4 | Snowflake architecture: micro-partitions, clustering, and Time Travel | Three-layer architecture (Storage / Query Processing / Cloud Services), micro-partition pruning via min/max metadata, clustering keys, scale-up vs scale-out, Time Travel + Fail-safe + Zero-Copy Cloning — the product-knowledge primitive every Snowflake-as-tool interviewer probes. |
Dual-audience framing rule: Snowflake interview prompts split into two categories. If the candidate is interviewing AT Snowflake the company, sections 1–2 (Python algorithms) carry the loop. If the candidate is interviewing for a data-engineering role at a Snowflake-using company, sections 3–4 (SQL window functions + Snowflake architecture) carry the loop. State which loop you're prepping for and rebalance the four primitives accordingly.
1. Python Arrays and Set Validation for the SET Card Game
Array iteration with set-based validation in Python for data engineering
"Given three SET cards (each a 4-tuple of attributes from {color, shape, number, shading}), validate that across all three cards every attribute is either all the same or all different" is Snowflake's signature EASY Python prompt (SET Card Game Validation). The mental model: zip(*cards) produces one tuple per attribute index across all three cards; set(values) deduplicates that tuple; the SET game rule is satisfied when every attribute's set has cardinality 1 (all same) or 3 (all different) — never 2. Same primitive powers any "validate a row across N records" pipeline — schema-conformance checks, all-same-or-all-different label validation in ML datasets, "every replica reports the same status" health monitors.
Pro tip:
len(set(values))collapses three separate equality checks into a single primitive. Avoid the temptation to write(a == b == c) or (a != b and b != c and a != c)— the set-cardinality test is shorter, faster, and idiomatic. State the cardinality invariant out loud before writing code.
Iterating tuples by attribute index: zip(*cards)
The transpose-via-zip invariant: zip(*iterables) interleaves elements at the same index across all iterables; zip(*cards) where cards is a list of 3 tuples produces 4 tuples, one per attribute, each with 3 values. This is the standard Python idiom for column-wise iteration over a row-major data structure.
-
zip(*cards)— yields per-attribute tuples;*unpacks the outer list. -
list(zip(*cards))— materialize into a list if you need to iterate twice. -
Equivalent loop —
for i in range(len(cards[0])): values = tuple(c[i] for c in cards)— verbose but explicit. -
Same pattern for matrix transpose —
zip(*matrix)is the one-line transpose.
Worked example. Three cards as 4-tuples; transpose to 4 attribute tuples.
| step | output |
|---|---|
| input | [('R', 'oval', 1, 'solid'), ('G', 'oval', 2, 'solid'), ('P', 'oval', 3, 'solid')] |
zip(*cards) |
[('R','G','P'), ('oval','oval','oval'), (1,2,3), ('solid','solid','solid')] |
Worked-example solution.
cards = [('R', 'oval', 1, 'solid'),
('G', 'oval', 2, 'solid'),
('P', 'oval', 3, 'solid')]
attributes = list(zip(*cards))
# [('R','G','P'), ('oval','oval','oval'), (1,2,3), ('solid','solid','solid')]
Rule of thumb: zip(*rows) is the one-line Python transpose; reach for it whenever the question asks about "across all rows" properties.
All-same-or-all-different rule via len(set(values)) in {1, 3}
The cardinality invariant: set(values) deduplicates a sequence; len(set(values)) returns the number of distinct elements; for a 3-element input, that count is 1 (all same), 2 (mixed — invalid), or 3 (all different). The SET game rule accepts cardinality 1 or 3 and rejects 2.
-
len(set(vals)) == 1— all values equal. -
len(set(vals)) == 3— all values pairwise distinct. -
len(set(vals)) == 2— invalid mixed pattern (rejected by the rule). -
len(set(vals)) in {1, 3}— combined acceptance test in one expression.
Worked example. Three attribute tuples; one valid (all same), one valid (all different), one invalid (mixed).
| values | set(values) | len | rule |
|---|---|---|---|
('oval','oval','oval') |
{'oval'} |
1 | all same ✓ |
('R','G','P') |
{'R','G','P'} |
3 | all different ✓ |
('R','R','G') |
{'R','G'} |
2 | invalid ✗ |
Worked-example solution.
def attribute_ok(values):
return len(set(values)) in {1, 3}
attribute_ok(('oval','oval','oval')) # True
attribute_ok(('R','G','P')) # True
attribute_ok(('R','R','G')) # False
Rule of thumb: the cardinality test (len(set(...)) in {1, 3}) is the one-line replacement for any "all equal OR all distinct" check; never expand it into a conjunction of pairwise comparisons.
all() composition over per-attribute checks
The composition invariant: all(predicate(x) for x in iterable) short-circuits on the first False and returns True only if every element passes. Combined with zip(*cards), the entire SET validation collapses to one expression: all(len(set(v)) in {1, 3} for v in zip(*cards)).
-
all(...)— returnsTrueif every element is truthy; short-circuits. -
any(...)— mirror image; returnsTrueif any element is truthy. -
Generator expression —
(predicate(x) for x in iter)is lazy; pairs perfectly withall. -
Avoid
for+ flag —ok = True; for ...: if not check: ok = False; return okis verbose.
Worked example. Compose attribute_ok over the four attribute tuples of a valid SET.
| attribute | values | ok |
|---|---|---|
| color | ('R','G','P') |
True |
| shape | ('oval','oval','oval') |
True |
| number | (1,2,3) |
True |
| shading | ('solid','solid','solid') |
True |
Worked-example solution.
def is_valid_set(cards):
return all(len(set(v)) in {1, 3} for v in zip(*cards))
cards = [('R','oval',1,'solid'),
('G','oval',2,'solid'),
('P','oval',3,'solid')]
is_valid_set(cards) # True
Rule of thumb: the entire problem is a one-liner once you compose all + set + zip(*cards); if your code needs more than three lines, you're missing a primitive.
Common beginner mistakes
- Writing
(a == b == c) or (a != b and b != c and a != c)instead oflen(set(values)) in {1, 3}— slow, verbose, and easy to mistype. - Using
cards[0][i], cards[1][i], cards[2][i]indexing instead ofzip(*cards)— works but signals you don't know the idiom. - Treating cardinality 2 as valid — SET's whole rule is "no 2-distinct attributes allowed"; stuttering on this is graded as not understanding the prompt.
- Hardcoding 3 cards / 4 attributes instead of letting
zipandallderive the dimensions — the same code should work for any N-card / M-attribute generalization. - Returning
1or0instead ofTrue/False— the contract is a boolean.
Python Interview Question on SET Card Game Validation
Given three SET cards as 4-tuples of attributes from {color, shape, number, shading}, return True if every attribute is either all-same or all-different across the three cards, otherwise False.
def is_valid_set(cards):
pass # your code here
Solution Using all, set, and zip(*cards)
def is_valid_set(cards):
return all(len(set(values)) in {1, 3} for values in zip(*cards))
Why this works: zip(*cards) transposes the 3×4 row-major card list into 4 attribute tuples of length 3; len(set(values)) counts distinct values per attribute and is 1 (all same) or 3 (all different) iff the SET rule is satisfied; len(...) in {1, 3} accepts both valid cardinalities and rejects the 2-distinct case; all(...) short-circuits to False on the first invalid attribute, returning True only when every attribute passes. The whole solution is one line, branch-free, and O(N · M) for N cards × M attributes.
Step-by-step trace for cards = [('R','oval',1,'solid'), ('G','oval',2,'solid'), ('P','oval',3,'solid')]:
-
Transpose —
zip(*cards)→[('R','G','P'), ('oval','oval','oval'), (1,2,3), ('solid','solid','solid')]. -
Per-attribute
set—{'R','G','P'}(len 3),{'oval'}(len 1),{1,2,3}(len 3),{'solid'}(len 1). - Cardinality check — 3 ∈ {1, 3} ✓; 1 ∈ {1, 3} ✓; 3 ∈ {1, 3} ✓; 1 ∈ {1, 3} ✓.
-
all— every checkTrue→ returnsTrue.
Output:
| attribute | values | len(set) | in {1, 3} |
|---|---|---|---|
| color | R, G, P |
3 | ✓ |
| shape | oval, oval, oval |
1 | ✓ |
| number | 1, 2, 3 |
3 | ✓ |
| shading | solid, solid, solid |
1 | ✓ |
| result | True |
Why this works — concept by concept:
-
zip(*cards)transpose — turns row-major card data into column-major attribute tuples; the standard Python one-liner for "iterate by attribute across rows." -
set(values)dedupe — collapses duplicates solen(...)returns the distinct count in O(N) hash-set inserts. -
len(...) in {1, 3}cardinality test — the SET game rule expressed as a single set-membership check; rejects the 2-distinct case implicitly. -
all(...)short-circuit — returnsFalseon the first invalid attribute without scanning the rest; pairs naturally with the generator expression. - one-line composition — the entire solution is a single expression, branch-free, with no temporary state or accumulator.
-
O(N · M)time /O(M)space —Ncards ×Mattributes; each per-attribute set holds at mostNitems, so space isO(M · N)worst case but typically justO(M)distinct sets.
Inline CTA: Drill the Snowflake Python practice page for the curated array problem and the array practice page for breadth.
PYTHON
Company — Snowflake (Python)
Snowflake Python practice problems
PYTHON
Topic — array
Python array problems
PYTHON
Company — Snowflake
Snowflake data engineering problems
2. Python Hash Tables and Sliding Window for Maximum Substring Occurrences
Hash-table sliding-window for substring frequency in Python for data engineering
"Given a string s and integers k (substring length) and maxLetters (max distinct chars allowed), return the maximum number of times any length-k substring with at most maxLetters distinct characters appears in s" is Snowflake's signature MEDIUM Python prompt (Maximum Substring Occurrences). The mental model: a length-k window slides across s; a freq dict tracks per-character counts inside the window in O(1) per shift; a Counter tracks how many times each valid window-substring has been seen; the answer is the max value in that counter. Same primitive powers any "find the most frequent fixed-length pattern under a constraint" pipeline — find the most-repeated DNA k-mer with at most M distinct nucleotides, the most-repeated user-event sequence under a uniqueness cap.
Pro tip: Maintain the
freqdict incrementally — increment for the entering character, decrement for the leaving character, and delete keys whose count hits 0 solen(freq)always equals the distinct character count in the window. Rebuildingfreqfrom scratch each window isO(k)per shift and turns the algorithmO(n · k); the incremental update isO(1)per shift and the total runtime isO(n).
Char-frequency dict for the window: increment / decrement per character
The window-state invariant: the freq dict at any time reflects the multiset of characters in the current length-k window; updating from window [i..i+k-1] to [i+1..i+k] requires one increment (for s[i+k]) and one decrement (for s[i]). Cleanup the zero-count key so len(freq) always equals the distinct character count.
-
Increment —
freq[c] = freq.get(c, 0) + 1. -
Decrement —
freq[c] -= 1; thendel freq[c] if freq[c] == 0(the cleanup step). -
Distinct-char count —
len(freq)(with cleanup) gives the number of distinct chars in the window. -
defaultdict(int)— alternative; same outcome, slightly cleaner increment but you still need the zero-cleanup branch.
Worked example. Slide a length-3 window over 'abca'; show freq per position.
| window | freq | distinct |
|---|---|---|
abc |
{a:1, b:1, c:1} |
3 |
bca |
{b:1, c:1, a:1} |
3 |
Worked-example solution.
s, k = 'abca', 3
freq = {}
for i, c in enumerate(s[:k]):
freq[c] = freq.get(c, 0) + 1
# freq == {'a': 1, 'b': 1, 'c': 1}; len(freq) == 3
# slide: out s[0]='a', in s[3]='a'
freq['a'] -= 1
if freq['a'] == 0: del freq['a']
freq['a'] = freq.get('a', 0) + 1
# freq == {'b': 1, 'c': 1, 'a': 1}
Rule of thumb: always pair the decrement with the zero-cleanup; otherwise len(freq) overcounts distinct characters and the constraint check breaks.
Distinct-char count via len(freq) and zero-cleanup
The distinct-count invariant: after every increment / decrement + cleanup, len(freq) equals the number of distinct characters currently in the window. This is the per-window check against maxLetters.
-
Valid window —
len(freq) <= maxLetters. -
Cleanup is non-negotiable — without
del freq[c] if freq[c] == 0, a key with count 0 still inflateslen(freq). -
No
set()rebuild —len(set(window))isO(k)per check; the incremental dict givesO(1). - Snapshot only on demand — record the substring in the answer counter only when the validity check passes.
Worked example. Validate windows of 'abcabcab' with maxLetters = 2.
| window | distinct | valid? |
|---|---|---|
abc |
3 | no |
bca |
3 | no |
cab |
3 | no |
Worked-example solution.
s, k, maxLetters = 'abcabcab', 3, 2
# every length-3 window has 3 distinct chars → no valid window → answer = 0
Rule of thumb: the distinct-char check is one comparison (len(freq) <= maxLetters); never recompute distinct count from scratch.
Substring counter for the answer: Counter or dict.get(s, 0) + 1 over candidates
The answer-aggregation invariant: a separate Counter (or vanilla dict) keyed on the substring tracks how many times each valid window-substring has been seen; the final answer is the maximum value in that counter. The constraint filter happens before the counter increment, so only valid substrings ever land in the answer.
-
from collections import Counter—Counter()[k] += 1is a clean increment. -
Vanilla dict —
answer[sub] = answer.get(sub, 0) + 1; same outcome, no import. -
Final answer —
max(answer.values(), default=0); thedefault=0handles the empty-counter case. -
Substring slice —
s[i:i+k]isO(k)to create; one slice per valid window is the cost.
Worked example. Count valid windows for 'aababcaab' with k=3, maxLetters=2.
| window | distinct | counter |
|---|---|---|
aab |
2 | {aab: 1} |
aba |
2 | {aab: 1, aba: 1} |
bab |
2 | {aab: 1, aba: 1, bab: 1} |
abc |
3 | (skipped) |
bca |
3 | (skipped) |
caa |
2 | {aab: 1, aba: 1, bab: 1, caa: 1} |
aab |
2 | {aab: 2, aba: 1, bab: 1, caa: 1} |
Worked-example solution.
from collections import Counter
s, k, maxLetters = 'aababcaab', 3, 2
freq, answer = {}, Counter()
# ... incremental loop ...
# answer == {'aab': 2, 'aba': 1, 'bab': 1, 'caa': 1}
# max(answer.values()) == 2
Rule of thumb: keep two state structures — one for the rolling window (freq), one for the answer (Counter); never conflate them.
Common beginner mistakes
- Rebuilding
freqfroms[i:i+k]each iteration —O(n · k)instead ofO(n). - Forgetting the zero-cleanup —
len(freq)overcounts distinct chars and the validity check rejects valid windows. - Counting every window (not just valid ones) — pollutes the answer counter with substrings that violate
maxLetters. - Using
if sub in answer: answer[sub] += 1 else: answer[sub] = 1— works but reads worse thanCounterordict.get(sub, 0) + 1. - Returning the first valid substring's count instead of
max(answer.values())— mis-reads the prompt.
Python Interview Question on Maximum Substring Occurrences
Given a string s and integers k and maxLetters, return the maximum number of occurrences of any length-k substring of s whose distinct-character count is at most maxLetters. If no such substring exists, return 0.
def maxFreq(s: str, k: int, maxLetters: int) -> int:
pass # your code here
Solution Using rolling freq dict + Counter over valid substrings
from collections import Counter
def maxFreq(s: str, k: int, maxLetters: int) -> int:
if k > len(s):
return 0
freq, answer = {}, Counter()
# initialize the first window
for c in s[:k]:
freq[c] = freq.get(c, 0) + 1
if len(freq) <= maxLetters:
answer[s[:k]] += 1
# slide
for i in range(1, len(s) - k + 1):
out_c, in_c = s[i - 1], s[i + k - 1]
freq[out_c] -= 1
if freq[out_c] == 0:
del freq[out_c]
freq[in_c] = freq.get(in_c, 0) + 1
if len(freq) <= maxLetters:
answer[s[i:i + k]] += 1
return max(answer.values(), default=0)
Why this works: the freq dict is updated incrementally — one increment (entering char) and one decrement-plus-cleanup (leaving char) per shift, total O(1) per window — so the rolling distinct-character count len(freq) is always correct in constant time; len(freq) <= maxLetters filters out invalid windows before they touch the answer counter; valid window substrings are tallied in a Counter; the final max(answer.values(), default=0) returns the most frequent valid substring's count, defaulting to 0 if nothing qualified.
Step-by-step trace for s = 'aababcaab', k = 3, maxLetters = 2:
| i | window | freq (after slide) | distinct | valid? | answer |
|---|---|---|---|---|---|
| 0 | aab |
{a:2, b:1} |
2 | ✓ | {aab: 1} |
| 1 | aba |
{a:2, b:1} |
2 | ✓ | {aab: 1, aba: 1} |
| 2 | bab |
{a:1, b:2} |
2 | ✓ | {aab: 1, aba: 1, bab: 1} |
| 3 | abc |
{a:1, b:1, c:1} |
3 | ✗ | (unchanged) |
| 4 | bca |
{a:1, b:1, c:1} |
3 | ✗ | (unchanged) |
| 5 | caa |
{a:2, c:1} |
2 | ✓ | {aab: 1, aba: 1, bab: 1, caa: 1} |
| 6 | aab |
{a:2, b:1} |
2 | ✓ | {aab: 2, aba: 1, bab: 1, caa: 1} |
Final: max(answer.values()) == 2.
Output:
| substring | occurrences |
|---|---|
| aab | 2 |
| aba | 1 |
| bab | 1 |
| caa | 1 |
| answer | 2 |
Why this works — concept by concept:
-
Rolling
freqdict — the per-window character-frequency map is updated inO(1)per shift via one increment and one decrement-plus-cleanup; total scan isO(n). -
Zero-cleanup via
del freq[c] if freq[c] == 0— keepslen(freq)equal to the distinct character count; without it the constraint check would reject valid windows. -
len(freq) <= maxLettersvalidity filter — single comparison; only valid windows enter the answer counter. -
Counterover substrings — tallies how many times each valid substring has been seen;Countershort-circuits the missing-key branch. -
max(answer.values(), default=0)— final aggregation; thedefault=0handles the case where no window qualified. -
O(n · k)time /O(n)space — the loop isO(n)shifts, each with aO(k)substring slice for the answer key; the answer counter holds at mostO(n - k + 1)distinct substrings.
Inline CTA: More sliding-window problems and hash-table problems on PipeCode.
PYTHON
Company — Snowflake (Python)
Snowflake Python practice problems
PYTHON
Topic — hash table
Python hash-table problems
PYTHON
Topic — sliding window
Python sliding-window problems
3. SQL Window Functions (LAG, LEAD, AVG OVER) for Snowflake Analytics
Window functions and CTE composition in SQL for Snowflake data engineering
"Find every contact who had a marketing touch in three or more consecutive weeks AND at least one trial_request" is the canonical Snowflake-flavored SQL prompt — the DataLemur-staple marketing-touch streak question that surfaces in nearly every Snowflake-as-tool DE interview. The mental model: DATE_TRUNC('week', event_date) snaps event timestamps to the week boundary; LAG(week_trunc) OVER (PARTITION BY contact_id ORDER BY week_trunc) reaches one row back inside the per-contact partition; the streak condition is lag_week = current_week - INTERVAL '1 week'. Same primitive powers any "consecutive-period" or "day-over-day" analytic — consecutive-month subscription renewals, hour-over-hour traffic deltas, week-over-week feature-adoption ramps.
Pro tip: Always ground your window function with both
PARTITION BYandORDER BY. ForgettingPARTITION BY contact_idmakesLAGreach into the previous contact's events and produces meaningless deltas; forgettingORDER BY week_truncleaves the row order non-deterministic and the answer flaky. State both clauses out loud before writing the SELECT.
Window basics: PARTITION BY group + ORDER BY ordering
The window-function invariant: OVER (PARTITION BY <group> ORDER BY <ordering>) declares an independent ordered subset for every value of the group expression; the window function (LAG, LEAD, AVG, ROW_NUMBER, …) runs inside that subset only. PARTITION BY is the equivalent of GROUP BY for windows; ORDER BY fixes the row order so offset functions (LAG/LEAD) and ranking functions (ROW_NUMBER/RANK) are deterministic.
-
PARTITION BYis optional — omitted, the whole table is one window; useful for global running totals. -
ORDER BYis required for offset / ranking functions —LAG,LEAD,ROW_NUMBER,RANKall need an explicit order. -
AVG(...) OVER (...)— group-aggregate-on-row; returns the average across the partition for each row. -
Frames —
ROWS BETWEEN ... AND ...further bound the visible rows; default is fine forLAG.
Worked example. Two events for contact_id 1; LAG(week_trunc) reaches one row back inside the partition.
| event_id | contact_id | week_trunc | LAG(week_trunc) |
|---|---|---|---|
| 1 | 1 | 2022-04-11 | NULL |
| 2 | 1 | 2022-04-18 | 2022-04-11 |
Worked-example solution.
SELECT event_id, contact_id, week_trunc,
LAG(week_trunc) OVER (PARTITION BY contact_id ORDER BY week_trunc) AS lag_week
FROM events_with_week;
Rule of thumb: always set PARTITION BY to the entity that owns the time series (contact_id, user_id, account_id); without it, day-1 of one entity leaks into day-N of another.
LAG vs LEAD vs AVG OVER: row-N-before vs row-N-after vs group-aggregate-on-row
The window-function-family invariant: LAG(expr, n) returns expr from n rows before; LEAD(expr, n) from n rows after; AVG(expr) OVER (...) returns the average of expr across the entire partition (not just the rows up to current). They share the OVER syntax but answer different questions.
-
LAG(week_trunc)— previous row'sweek_trunc. -
LEAD(week_trunc)— next row'sweek_trunc. -
AVG(stars) OVER (PARTITION BY product_id, EXTRACT(MONTH FROM submit_date))— monthly average per product (DataLemur Q2 pattern). -
ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...)— sequential rank inside each partition.
Worked example. Three rows for contact_id 1; both LAG and LEAD annotated.
| event_id | week_trunc | LAG | LEAD |
|---|---|---|---|
| 1 | 2022-04-11 | NULL | 2022-04-18 |
| 2 | 2022-04-18 | 2022-04-11 | 2022-04-25 |
| 3 | 2022-04-25 | 2022-04-18 | NULL |
Worked-example solution.
SELECT event_id, week_trunc,
LAG(week_trunc) OVER (PARTITION BY contact_id ORDER BY week_trunc) AS lag_week,
LEAD(week_trunc) OVER (PARTITION BY contact_id ORDER BY week_trunc) AS lead_week
FROM events_with_week
WHERE contact_id = 1;
Rule of thumb: "delta vs prior" → LAG; "delta vs next" or "next-event distance" → LEAD; "row's value compared to group average" → AVG ... OVER.
CTE composition: WITH consecutive_events AS (...) for multi-step logic
The CTE composition invariant: a WITH ... AS (...) Common Table Expression names an intermediate result that subsequent SELECT statements can reference; multi-step window logic is far more readable as a CTE pipeline than as a deeply nested subquery. Snowflake supports CTEs natively, including recursive CTEs for hierarchical traversal.
-
WITH cte_name AS (SELECT ...)— defines the named intermediate. -
Multiple CTEs —
WITH a AS (...), b AS (...) SELECT ... FROM b JOIN a ...— chain steps. -
Recursive CTE —
WITH RECURSIVE cte AS (base UNION ALL recursive); rare in interview but Snowflake-supported. - Materialization — Snowflake may inline or materialize CTEs; behavior depends on the query optimizer.
Worked example. Compute lag_week and lead_week once in a CTE, then filter for the streak condition.
| step | what | output |
|---|---|---|
| 1 | CTE | rows with current/lag/lead weeks |
| 2 | filter | rows where lag = current - 1 week OR lead = current + 1 week |
| 3 | join + filter | contacts with trial_request AND streak |
Worked-example solution.
WITH consecutive_events AS (
SELECT event_id, contact_id, event_type,
DATE_TRUNC('week', event_date) AS current_week,
LAG(DATE_TRUNC('week', event_date))
OVER (PARTITION BY contact_id ORDER BY DATE_TRUNC('week', event_date)) AS lag_week,
LEAD(DATE_TRUNC('week', event_date))
OVER (PARTITION BY contact_id ORDER BY DATE_TRUNC('week', event_date)) AS lead_week
FROM marketing_touches
)
SELECT DISTINCT c.email
FROM consecutive_events e
INNER JOIN crm_contacts c ON e.contact_id = c.contact_id
WHERE (e.lag_week = e.current_week - INTERVAL '1 week'
OR e.lead_week = e.current_week + INTERVAL '1 week')
AND e.contact_id IN (
SELECT contact_id FROM marketing_touches WHERE event_type = 'trial_request'
);
Rule of thumb: use a CTE the moment your window logic needs more than one step; deeply nested subqueries with embedded OVER (...) clauses are unreadable.
Common beginner mistakes
- Forgetting
PARTITION BY contact_id—LAGreaches into the previous contact's events and produces meaningless deltas. - Forgetting
ORDER BY week_trunc— row order is non-deterministic;LAGreturns whichever row the planner happened to pick. - Using
WHEREto filter aggregates —WHERE COUNT(*) > 1is a parse error; useHAVINGor wrap in a subquery / CTE. - Counting calendar weeks via
(event_date - lag_event_date) / 7instead ofDATE_TRUNC('week', ...)— fails on edge cases like daylight-saving-time crossings. - Hardcoding the offset (
LAG(volume, 1)) when the prompt says "vs 7 days ago" — read the spec and useLAG(volume, 7).
SQL Interview Question on Marketing-Touch Streak Detection
Given marketing_touches(event_id, contact_id, event_type, event_date) and crm_contacts(contact_id, email), return the email of every contact who had at least three consecutive-week marketing touches AND at least one event_type = 'trial_request'.
Solution Using LAG / LEAD window functions inside a CTE
WITH consecutive_events_cte AS (
SELECT event_id, contact_id, event_type,
DATE_TRUNC('week', event_date) AS current_week,
LAG(DATE_TRUNC('week', event_date))
OVER (PARTITION BY contact_id ORDER BY DATE_TRUNC('week', event_date)) AS lag_week,
LEAD(DATE_TRUNC('week', event_date))
OVER (PARTITION BY contact_id ORDER BY DATE_TRUNC('week', event_date)) AS lead_week
FROM marketing_touches
)
SELECT DISTINCT contacts.email
FROM consecutive_events_cte AS events
INNER JOIN crm_contacts AS contacts
ON events.contact_id = contacts.contact_id
WHERE (events.lag_week = events.current_week - INTERVAL '1 week'
OR events.lead_week = events.current_week + INTERVAL '1 week')
AND events.contact_id IN (
SELECT contact_id FROM marketing_touches WHERE event_type = 'trial_request'
);
Why this works: the CTE truncates each event to its week boundary and uses LAG/LEAD to surface the adjacent weeks inside each contact's partition; the streak condition is satisfied when either the lag or lead week is exactly ±1 week away (which means the contact has at least two consecutive-week events, which combined with the current row means three weeks in a window); the IN subquery enforces the "at least one trial_request" requirement; the INNER JOIN resolves emails; SELECT DISTINCT collapses duplicates from contacts with multiple qualifying weeks.
Step-by-step trace for the DataLemur sample (contact_id 1 across April 17, 23, 30 + May 14):
| event_id | contact_id | week_trunc | lag_week | lead_week | streak? |
|---|---|---|---|---|---|
| 1 | 1 | 2022-04-11 | NULL | 2022-04-18 | ✓ (lead) |
| 2 | 1 | 2022-04-18 | 2022-04-11 | 2022-04-25 | ✓ (both) |
| 3 | 1 | 2022-04-25 | 2022-04-18 | NULL | ✓ (lag) |
- CTE materializes — three rows per contact_id with current/lag/lead weeks.
- Streak filter — every row of contact_id 1 passes (lag or lead = current ± 7 days).
-
trial_requestfilter — contact_id 1 has event_id 2 =trial_request→ passes. -
Inner-join + DISTINCT — emits
andy.markus@att.netonce.
Output:
| andy.markus@att.net |
Why this works — concept by concept:
-
DATE_TRUNC('week', event_date)— snaps timestamps to the week boundary so consecutive-week comparisons are exact; works in Snowflake, Postgres, BigQuery alike. -
LAG/LEADwindow — surfaces adjacent rows inside the per-contact partition without a self-join;O(n log n)with one window sort instead ofO(n²)with a self-join. - CTE composition — keeps the window logic separate from the streak filter; makes the query readable and the optimizer plan stable.
-
ORbetween lag and lead — accepts a row whose neighbor on either side is one week away, which combined with the chain of neighbors implies three consecutive weeks once any qualifying contact has at least three rows. -
INsubquery fortrial_request— enforces the second condition without a separate JOIN; the optimizer typically inlines this as a semi-join. -
O((|events|·log|events|) + |events|·|contacts|)time — the window sort dominates; the join is hash-based.
Inline CTA: More SQL window-function problems and CTE problems on PipeCode.
SQL
Company — Snowflake
Snowflake SQL practice problems
SQL
Topic — window functions
SQL window-function problems
SQL
Topic — CTE
SQL CTE problems
4. Snowflake Architecture: Micro-Partitions, Clustering, and Time Travel
Snowflake's three-layer architecture and product-knowledge primitives
"Design Snowflake clustering, virtual warehouse, and Time Travel retention for a 10TB events table queried frequently by date and region" is the canonical Snowflake architecture interview prompt. The mental model: Snowflake has three independent layers — Database Storage (immutable micro-partitions in S3 / Azure Blob / GCS), Query Processing (Virtual Warehouses, scale-up for compute, scale-out for concurrency), and Cloud Services (auth, metadata, optimizer, security). Performance comes from micro-partition pruning (the optimizer skips partitions whose min/max metadata excludes the WHERE predicate) and clustering keys that co-locate rows on chosen columns. Recovery comes from Time Travel (1–90 days of point-in-time queries) plus Fail-safe (7 additional days of Snowflake-managed recovery). Cloning comes from Zero-Copy Cloning (instant snapshot via metadata pointers, no storage duplication).
Pro tip: Snowflake is OLAP, not OLTP — designed for analytical workloads, not row-level UPDATE/DELETE traffic. It is also NOT an ETL tool — pair it with Airflow / dbt / Matillion for orchestration. Stating both framings unprompted in the interview signals product fluency and cuts off the most common follow-up traps.
Three-layer architecture: storage / query processing / cloud services
The architecture invariant: Snowflake separates storage, compute, and services so each scales independently — store petabytes while compute is off, spin up massive compute for a 10-minute job, and let the services layer manage metadata and optimization. Traditional warehouses (Teradata, Netezza) tightly couple storage and compute, forcing over-provisioning.
- Database Storage — columnar micro-partitions in cloud storage; immutable; you don't manage files.
- Query Processing — Virtual Warehouses; clusters of CPU/RAM/SSD; XS, S, M, L, XL, 2XL, 3XL, 4XL, 5XL, 6XL sizes.
- Cloud Services — auth, RBAC, metadata, optimizer, security; the "brain"; serverless from your perspective.
- Cross-cloud — runs on AWS, Azure, GCP; cross-cloud replication for disaster recovery.
Worked example. Sketch the layer responsibilities for a typical events table workload.
| layer | responsibility | scaling lever |
|---|---|---|
| Storage | persist micro-partitions; immutable | grow with data volume |
| Query Processing | execute SQL on Virtual Warehouses | scale up (size) or out (clusters) |
| Cloud Services | auth, metadata, optimizer | serverless |
Worked-example solution.
-- size a warehouse for the workload
CREATE WAREHOUSE events_wh
WAREHOUSE_SIZE = 'LARGE'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 4;
Rule of thumb: size the warehouse to the heaviest query, set AUTO_SUSPEND = 60 for cost control, and use MIN/MAX_CLUSTER_COUNT for concurrency rather than oversizing a single cluster.
Micro-partitions and pruning via min/max metadata
The pruning invariant: Snowflake stores per-column min/max (and other) metadata for every micro-partition; on a WHERE predicate, the optimizer reads metadata and skips partitions whose value range cannot satisfy the predicate. This is "automatic indexing" — no CREATE INDEX required.
- Micro-partition — 50–500 MB compressed, columnar, immutable.
-
Min/max metadata — per-column, per-partition; enables
WHERE date >= '2023-01-01'to skip partitions whose max date < 2023-01-01. - Clustering — ensures min/max metadata is selective by co-locating rows on the chosen columns.
- Search Optimization Service — adds point-lookup acceleration on top of pruning for high-cardinality predicates.
Worked example. Pruning a query against a 1000-partition events table.
| predicate | partitions scanned | pruned |
|---|---|---|
WHERE date = '2023-06-15' |
~3 (one per cluster region) | 997 |
WHERE region = 'US-EAST' |
~250 (with clustering on region) | 750 |
WHERE customer_id = 12345 |
1000 (no clustering on customer_id) | 0 |
Worked-example solution.
-- enable clustering for predictable pruning
ALTER TABLE events
CLUSTER BY (DATE_TRUNC('day', event_date), region);
Rule of thumb: cluster on the columns you filter / aggregate by most often (typically date + a low-cardinality dimension); never cluster on a high-cardinality random column like user_id.
Clustering keys + scale-up vs scale-out + Time Travel + Zero-Copy Cloning
The composition invariant: Snowflake's full performance + recovery story stacks four primitives — clustering keys for storage layout, warehouse sizing for query speed (scale up) and concurrency (scale out), Time Travel for point-in-time recovery, and Zero-Copy Cloning for instant dev/QA environments. Each is independent; combine them per workload requirements.
-
CLUSTER BY (col1, col2)— co-locates rows on the chosen columns; benefits pruning. -
Scale up —
WAREHOUSE_SIZE = 'XLARGE'; doubles compute per cluster; speeds up one big query. -
Scale out —
MAX_CLUSTER_COUNT = 4; spins up additional clusters for concurrent queries. -
Time Travel —
DATA_RETENTION_TIME_IN_DAYS = 7; query past data withAT (TIMESTAMP => '...'). -
Zero-Copy Cloning —
CREATE TABLE events_dev CLONE events;— instant, no storage duplication.
Worked example. Warehouse + clustering + Time Travel + clone for the events table.
| concern | setting |
|---|---|
| query speed | WAREHOUSE_SIZE = 'LARGE' |
| concurrency | MAX_CLUSTER_COUNT = 4 |
| pruning | CLUSTER BY (DATE_TRUNC('day', event_date), region) |
| recovery window | DATA_RETENTION_TIME_IN_DAYS = 7 |
| dev environment | CREATE TABLE events_dev CLONE events; |
Worked-example solution.
ALTER TABLE events
SET DATA_RETENTION_TIME_IN_DAYS = 7
CLUSTER BY (DATE_TRUNC('day', event_date), region);
CREATE TABLE events_dev CLONE events;
-- query a past state
SELECT *
FROM events
AT (TIMESTAMP => '2026-04-30 09:00:00'::timestamp);
Rule of thumb: set Time Travel retention to match your incident-response window (typically 7 days for production, 1 day for staging); the higher the retention, the higher the storage cost.
Common beginner mistakes
- Calling Snowflake an ETL tool — it is a data warehouse; pair it with Airflow / dbt / Matillion / Airbyte.
- Calling Snowflake OLTP — singleton UPDATE/DELETE traffic is poorly served; Snowflake is OLAP-first.
- Asking "how do I create an index?" — Snowflake doesn't use traditional indexes; min/max metadata on micro-partitions handles pruning automatically.
- Suggesting
VACUUM/ANALYZE— Snowflake is fully managed; no manual maintenance. - Storing Time Travel as a backup substitute — Time Travel is for accidental drops within a window; Fail-safe is the 7-day Snowflake-managed safety net beyond Time Travel; for long-term backups use replicated databases or unloaded files.
SQL Interview Question on Snowflake Architecture Design
Design the Snowflake configuration for a 10TB events table that is queried daily by date and region, must support 4-day point-in-time recovery, and needs an isolated dev environment for testing schema changes. Specify the warehouse size, clustering keys, Time Travel retention, and clone strategy.
Solution Using clustering + multi-cluster warehouse + 7-day Time Travel + Zero-Copy Clone
-- 1. Warehouse: scale up (LARGE) for query speed; scale out (max 4 clusters) for concurrency
CREATE OR REPLACE WAREHOUSE events_wh
WAREHOUSE_SIZE = 'LARGE'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE
MIN_CLUSTER_COUNT = 1
MAX_CLUSTER_COUNT = 4;
-- 2. Clustering: co-locate rows by query predicates (date + region)
ALTER TABLE events
CLUSTER BY (DATE_TRUNC('day', event_date), region)
SET DATA_RETENTION_TIME_IN_DAYS = 7;
-- 3. Dev environment: instant zero-copy clone
CREATE OR REPLACE TABLE events_dev CLONE events;
-- 4. Verify pruning effectiveness on a sample query
SELECT *
FROM events
WHERE event_date = '2026-04-30'
AND region = 'US-EAST'
LIMIT 10;
Why this works: the LARGE warehouse fits a 10TB scan into memory + local SSD without spilling; auto-suspend at 60s + auto-resume eliminates idle compute cost; clustering on (DATE_TRUNC('day', event_date), region) aligns the natural query predicates with the partition layout so pruning scans only ~3 partitions for a single-day single-region query (down from ~1000); 7-day Time Travel covers a 4-day recovery requirement with margin; Zero-Copy Cloning produces events_dev instantly, with metadata pointers — the dev table only consumes storage as it diverges from production.
Step-by-step trace of the design rationale:
- Workload analysis — 10TB table; daily queries by date + region; 4-day recovery requirement; dev environment needed.
- Compute layer — LARGE warehouse for the heaviest query; multi-cluster (1–4) for concurrency without oversize.
- Storage layer — cluster on (date, region) so min/max metadata aligns with predicates → pruning skips ~99.7% of partitions on point queries.
- Recovery layer — 7-day Time Travel covers 4-day requirement with safety margin; Fail-safe gives an additional 7 days of Snowflake-managed recovery.
- Dev layer — Zero-Copy Clone produces instant dev table; storage diverges only on writes.
Output:
| layer | setting | rationale |
|---|---|---|
| Compute |
WAREHOUSE_SIZE = 'LARGE', MAX_CLUSTER_COUNT = 4
|
scale up + scale out |
| Storage | CLUSTER BY (DATE_TRUNC('day', event_date), region) |
maximize pruning on query predicates |
| Recovery | DATA_RETENTION_TIME_IN_DAYS = 7 |
covers 4-day requirement with margin |
| Dev | CREATE TABLE events_dev CLONE events; |
instant zero-copy environment |
| Cost | auto-suspend = 60s | idle compute paid only in seconds |
Why this works — concept by concept:
- Three-layer separation — independent scaling of storage / compute / services; the compute warehouse can be 'LARGE' today and `XSMALL' tomorrow without touching data.
- Micro-partition pruning — min/max metadata on each partition + clustering on predicate columns → the optimizer skips ~99.7% of partitions on a single-day single-region query.
-
Multi-cluster warehouse —
MAX_CLUSTER_COUNT = 4handles concurrency; instead of queueing 100 simultaneous queries on one cluster, additional clusters spin up. -
7-day Time Travel —
AT (TIMESTAMP => ...)queries any past state within retention; covers accidental drops, schema mistakes, and the stated 4-day recovery requirement. -
Zero-Copy Cloning —
CREATE TABLE ... CLONE ...is instant; storage is metadata-only until divergence; perfect for dev/QA from production. -
Cost via auto-suspend —
AUTO_SUSPEND = 60charges only for compute-active seconds; storage and services are billed separately and remain on while compute sleeps. -
O(partitions_scanned / partitions_total)pruning ratio — clustering on(date, region)aligns the natural query predicates with the partition layout, so a single-day single-region query scans ~3 of ~1000 partitions (≈ 0.3%); the remaining 99.7% are pruned via min/max metadata before any data is read.
Inline CTA: More dimensional-modeling problems and the Snowflake practice page for the full curated set.
SQL
Company — Snowflake
Snowflake SQL practice problems
DATA-MODELING
Topic — dimensional modeling
Dimensional-modeling problems
SQL
Topic — JSON / VARIANT
JSON / VARIANT SQL problems
Tips to crack Snowflake data engineering interviews
Two audiences — pick yours and rebalance the four primitives
The phrase "Snowflake interview" splits cleanly. Audience A — interviewing AT Snowflake the company — the loop is LeetCode-style Python: the curated 2 problems (#24 SET Card Game Validation, #161 Maximum Substring Occurrences) hint at the bar, and the algorithm fluency in §1 and §2 carries the round. Audience B — interviewing for a data-engineering role at a company that uses Snowflake — the loop is SQL window functions plus product knowledge, and §3 plus §4 carry the round. Pick yours upfront and rebalance prep time accordingly.
Drill the four primitives
The four primitives in this guide map directly to the curated 2 PipeCode Python problems plus the two adjacent primitives every Snowflake-flavored interview rotates through: Python zip(*cards) + set + all for SET-style validation (Python EASY — #24), Python rolling freq dict + Counter over substrings for windowed counting (Python MEDIUM — #161), SQL LAG/LEAD/AVG OVER PARTITION BY plus CTE composition for streak detection and analytical aggregations (SQL — DataLemur staple), and Snowflake architecture (3 layers, micro-partitions + clustering + Time Travel + Zero-Copy Cloning). Each maps to a specific module: vanilla dict and collections.Counter for the Python primitives, OVER (PARTITION BY ... ORDER BY ...) for windows, and the Snowflake DDL surface (CLUSTER BY, DATA_RETENTION_TIME_IN_DAYS, CLONE, WAREHOUSE_SIZE) for architecture.
Memorize the OLAP-not-OLTP and NOT-an-ETL-tool framing
The single most common Snowflake interview trap is treating Snowflake as a traditional RDBMS or as an ETL tool. State unprompted: "Snowflake is OLAP — analytical workloads, not row-level UPDATE/DELETE traffic; it pairs with Airflow / dbt / Matillion for orchestration; it doesn't use traditional indexes — micro-partition min/max metadata + clustering does the pruning automatically." This single framing flip cuts off the most common follow-up traps before they fire.
LeetCode-style fluency for Snowflake-the-company
The 2 PipeCode problems hint at the actual Snowflake-the-company DE / SWE interview bar: algorithm fluency in Python, not Snowflake product trivia. The interview tests zip(*iter) + set composition and rolling freq dict + Counter patterns, not "explain virtual warehouses." Drill the Python practice page and the easy practice page until the canonical EASY-tier code rolls off your fingers in under three minutes.
Snowpipe + Time Travel + Zero-Copy Cloning are the signature features — name them out loud
For Snowflake-as-tool interviews, the three signature product features that separate fluent candidates from bluffers are Snowpipe (serverless continuous ingestion via cloud-event notifications), Time Travel (1–90 day point-in-time recovery + 7-day Fail-safe beyond), and Zero-Copy Cloning (instant dev/QA snapshots via metadata pointers). Name them unprompted when discussing ingestion, recovery, or dev environments — interviewers grade product fluency and these three are the highest-signal features.
Easy-Medium discipline matters
The curated Snowflake set on PipeCode is 1 EASY + 1 MEDIUM Python. EASY at Snowflake doesn't mean trivial — it means the interviewer expects zero hesitation, idiomatic Python, and an articulated invariant. A correct EASY answer with stuttering on zip(*cards) or len(set(...)) in {1, 3} is graded worse than a correct MEDIUM answer with the same flaw. Drill the Easy practice page and the Medium practice page until the canonical EASY-tier code rolls off your fingers in under three minutes.
Where to practice on PipeCode
Start with the Snowflake practice page and the language-scoped Snowflake Python practice page for the curated 2-problem set. After that, drill the matching topic pages: array, hash table, sliding window, string, window functions, CTE, aggregation, dimensional modeling, JSON. The interview courses page bundles structured curricula. For broader coverage, browse by topic, or pivot to peer interview guides — the Airbnb DE interview guide and the top DE interview questions 2026 blog.
Communication and approach under time pressure
Talk through the invariant first ("this is a sliding-window plus hash-table problem"), the brute force second ("a triple-nested loop over all length-k substrings would also work"), and the optimal third ("but the rolling freq dict gives O(1) per shift instead of O(k)"). Interviewers grade process as much as the final answer. Leave 5 minutes for an edge-case sweep: empty input, k > len(s), single-character strings, maxLetters = 0, cards with 0 attributes, NULL in a window-function partition. The most common "almost passed" failure mode is correct happy-path code that crashes on edge cases — a 30-second sweep prevents it.
Frequently Asked Questions
What is the Snowflake data engineering interview process like?
The Snowflake data engineering interview opens with a 30-minute recruiter screen, then a 60-minute technical phone screen with a live SQL or Python coding problem, then a 4-round virtual onsite: one or two coding rounds (Python algorithms or SQL window functions, depending on whether the role is at Snowflake the company or at a Snowflake-using company), one system-design round (designing data pipelines, warehouse architecture, or trade reconciliation), one data-modeling discussion (star vs snowflake schema, dimensional modeling), and a behavioral round. End-to-end the loop runs three to four weeks.
What programming languages does Snowflake test in data engineering interviews?
Snowflake data engineering interviews are bilingual — SQL and Python in roughly equal measure across the loop. Python concentrates on array iteration with set-based validation, hash-table and sliding-window patterns, and dict-counter idioms. SQL concentrates on window functions (LAG, LEAD, AVG OVER PARTITION BY), CTE composition, aggregation, and Snowflake's specific dialect features (DATE_TRUNC, DATEADD, QUALIFY, VARIANT / JSON queries). Snowpark (Python/Java/Scala on Snowflake compute) appears at senior backend-leaning DE roles but is rarely tested in the live coding rounds.
How difficult are Snowflake data engineering interview questions?
The curated Snowflake practice set on PipeCode is 1 easy + 1 medium, no hard. The EASY is a Python array + set validation problem (SET Card Game Validation); the MEDIUM is a Python hash-table + sliding-window problem (Maximum Substring Occurrences). At the onsite, system-design and architecture questions can reach L4-L5 level — designing the warehouse + clustering + Time Travel for a 10TB events table, choosing scale-up vs scale-out — but the live coding rounds stay in the EASY-MEDIUM zone. Stuttering on the EASY is a stronger negative signal than struggling with the MEDIUM.
How should I prepare for a Snowflake data engineer interview?
Solve the 2 problems on the Snowflake practice page end-to-end — untimed first, then timed at 25 minutes per problem — and broaden to 30 to 50 additional problems across the matching topic pages: array, hash table, sliding window, string on the Python side, and window functions, CTE, aggregation, dimensional modeling on the SQL side. Read the Snowflake architecture chapter of the official documentation for a week (three layers, micro-partitions, virtual warehouses, Time Travel, Snowpipe, Zero-Copy Cloning). Practice articulating idempotency and clustering choices when discussing warehouse design — those framings are graded heavily at any Snowflake-flavored interview, and they're the difference between strong "Snowflake interview questions and answers" and bluff-level definitions.
What Snowflake-specific topics show up most often in interviews?
Five Snowflake-specific topics dominate every interview list — Snowflake architecture (the three-layer split between Database Storage, Query Processing via Virtual Warehouses, and Cloud Services); micro-partitions and pruning (50–500MB columnar blocks with min/max metadata that the optimizer uses to skip irrelevant partitions); Snowflake window functions (LAG, LEAD, AVG ... OVER (PARTITION BY ...) plus QUALIFY for window-result filtering); Time Travel + Fail-safe (1–90 day point-in-time recovery plus 7 days of Snowflake-managed safety net); and Zero-Copy Cloning (instant snapshots for dev/QA via metadata pointers). Master these five and you can answer any Snowflake-product question on any interview list.
Is Snowflake an ETL tool, and how does that affect the interview?
No — Snowflake is not an ETL tool. It is a cloud data warehouse designed for storing and analyzing data; ETL/ELT orchestration runs in tools like Apache Airflow, dbt, Matillion, Coalesce.io, or Airbyte that load data into Snowflake. This distinction shows up in interview "Snowflake ETL interview questions" — the correct answer is to clarify the architecture (Snowflake as the destination warehouse, an external orchestrator for transformations) and walk through the typical loading patterns (COPY INTO for bulk historical loads, Snowpipe for serverless continuous ingestion, External Tables for in-place queries against cloud storage). Stating "Snowflake is not an ETL tool" unprompted is one of the highest-signal product-fluency moves a candidate can make.




Top comments (0)