Lyft data engineering interview questions lean Python-fundamentals first with a focused SQL analytics finish: four Python primitives (heap-backed multi-stream merge for ordered event readers, prefix hash buckets for text autocomplete, hash-table plus two-pointer array intersection without set(), and binary search on the arr[i] - (i + 1) invariant for Nth-missing-integer counting), plus two SQL primitives that test ride-share analytics fluency (first-touch time-series aggregation for new-customers-per-day reports and LAG-driven window functions for consecutive-day buyer cohorts). The framings are mobility data—rider event streams, place autocomplete, route-segment intersection, sparse trip-id arrays, daily new-rider counts, and two-day return retention.
This guide walks through the six topic clusters Lyft actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 6-problem Lyft set (3 easy, 3 medium, no hard)—a fundamentals-heavy hub that rewards Python data-structure fluency, the binary-search invariant trick, and clean ride-share SQL over algorithm-puzzle prep.
Top Lyft data engineering interview topics
From the Lyft data engineering practice set, the six numbered sections below follow this topic map (one row per H2):
| # | Topic (sections 1–6) | Why it shows up at Lyft |
|---|---|---|
| 1 | Queue design and multi-stream merge in Python | Multi Stream Reader—heapq min-heap k-way merge over K sorted iterators. |
| 2 | Hash tables for text search autocomplete in Python | Text Search Auto-Complete—defaultdict(list) prefix bucket lookup for low-latency place suggestions. |
| 3 | Hash table plus two pointers for array intersection in Python | Array Intersection Without Set—Counter-style frequency dict with multi-set semantics, no set() allowed. |
| 4 | Binary search for the Nth missing integer in Python | Nth Missing Integer—arr[i] - (i + 1) invariant turns Nth-missing-positive into a clean monotone predicate. |
| 5 | SQL time-series aggregation for new customers per day | New Customers Per Day—MIN(order_date) GROUP BY user_id first-touch CTE, then COUNT per first-touch date. |
| 6 | SQL window functions for consecutive-day buyer cohorts | Customers Who Bought on 2 Consecutive Days—LAG(purchase_date) over a user-partitioned ordered window. |
Mobility-data framing rule: Lyft's prompts span ride-share analytics—rider events streamed in time order, place-name autocomplete, route-segment intersection, trip-id sparse arrays, daily new riders, two-day return retention. The interviewer is grading whether you map each business framing to the right primitive: ordered multi-stream → min-heap; prefix lookup → hash bucket; intersection without
set()→ frequency dict + two-pointer; Nth missing → binary search on the count-of-missing invariant; new-customers-per-day → first-touch + daily count; consecutive-day buyers →LAGpartitioned by user. State the mapping out loud.
1. Queue Design and Multi-Stream Merge in Python
Heap-backed queues for multi-stream merge in Python for data engineering
"Read K sorted event streams and emit them as one merged sorted stream" is the canonical queue-design interview prompt at Lyft. The mental model: a min-heap holds the current smallest unconsumed element from each stream; pop the global minimum, write it out, then push the next element from the same stream. This is the standard k-way merge—the same pattern that powers external sort, streaming analytics over partitioned topics, and any system that joins ordered event feeds (rider GPS pings, driver location updates, trip-state transitions) on event-time.
Pro tip: Python's
heapqoperates on the lowest lexicographic tuple. Push(value, stream_id, item)so ties onvalueresolve deterministically bystream_id—never push(value, item)whereitemis non-comparable (a dict, an object without__lt__), or your heap will raiseTypeErrormid-merge.
heapq as a min-heap priority queue
heapq is a heap-ordered list with O(log n) push and pop. The invariant: the smallest element is always at index 0. There is no Heap object—the module provides functions that operate on a regular list, treating it as a binary heap rooted at index 0.
-
heapq.heappush(h, x)—O(log n)insert; restores heap order after appending. -
heapq.heappop(h)—O(log n)remove the smallest; bubbles the last element down. -
heapq.heapify(lst)—O(n)in-place heapify of an existing list. -
Min-heap only — for a max-heap, push the negation (
-x) or wrap with a comparator class.
Worked example. Push the values 5, 2, 8, 1, 3 onto an empty heap one by one and trace the heap after each insertion.
| step | operation | heap state | min? |
|---|---|---|---|
| 1 | push 5 | [5] |
5 |
| 2 | push 2 | [2, 5] |
2 |
| 3 | push 8 | [2, 5, 8] |
2 |
| 4 | push 1 | [1, 2, 8, 5] |
1 |
| 5 | push 3 | [1, 2, 8, 5, 3] |
1 |
Worked-example solution.
import heapq
heap = []
for x in [5, 2, 8, 1, 3]:
heapq.heappush(heap, x)
print(heap, "min =", heap[0])
# pop them in sorted order
sorted_out = []
while heap:
sorted_out.append(heapq.heappop(heap))
print(sorted_out) # [1, 2, 3, 5, 8]
Rule of thumb: if the prompt mentions "smallest of many," "k-way merge," or "top-K," reach for heapq before you reach for sorted().
k-way merge: pull cheapest head, push next from same stream
Given K sorted iterators, the merge invariant is always emit the global minimum head; never advance more than one stream per step. Seed the heap with the first element from each stream tagged with its stream_id. Pop the min (value, stream_id) and write value to the output. Then pull the next element from streams[stream_id] and push it onto the heap. Repeat until the heap is empty.
-
Tag with
stream_idso on a tie the heap can break it deterministically and so the popped value tells you which stream to advance. - One pop, one push keeps the heap size at exactly K (one head per active stream).
-
Output order is globally sorted because at every step the heap contains every active stream's smallest unconsumed value, so
heap[0]is the global minimum. -
Total cost —
npops +npushes over a heap of size at most K →O(n log K)forntotal elements across K streams.
Worked example. Three sorted streams: s0 = [1, 4, 7], s1 = [2, 5, 8], s2 = [3, 6, 9]. Trace the heap and the output for the first four steps.
| step | heap before pop | popped | next pushed | output so far |
|---|---|---|---|---|
| 1 | [(1,s0), (2,s1), (3,s2)] |
(1,s0) |
(4,s0) |
[1] |
| 2 | [(2,s1), (4,s0), (3,s2)] |
(2,s1) |
(5,s1) |
[1, 2] |
| 3 | [(3,s2), (4,s0), (5,s1)] |
(3,s2) |
(6,s2) |
[1, 2, 3] |
| 4 | [(4,s0), (5,s1), (6,s2)] |
(4,s0) |
(7,s0) |
[1, 2, 3, 4] |
Worked-example solution.
import heapq
def k_way_merge(streams):
heap = []
iters = [iter(s) for s in streams]
for i, it in enumerate(iters):
try:
heapq.heappush(heap, (next(it), i))
except StopIteration:
pass
while heap:
val, i = heapq.heappop(heap)
yield val
try:
heapq.heappush(heap, (next(iters[i]), i))
except StopIteration:
pass
list(k_way_merge([[1, 4, 7], [2, 5, 8], [3, 6, 9]]))
# [1, 2, 3, 4, 5, 6, 7, 8, 9]
Rule of thumb: never carry the entire materialized list into the heap; only one head per stream lives in the heap at a time.
Stream exhaustion: pop without re-pushing when iterator drains
A stream is exhausted when its iterator raises StopIteration on next(it). The invariant: on exhaustion, pop without pushing back. The heap then shrinks by one and the merge continues over the remaining K-1 streams.
-
Use
try / except StopIterationaroundnext(it)to detect end of stream. - Heap size monotonically shrinks once any stream exhausts—do not refill from a different stream.
-
Tail-merge naturally degenerates — when only one stream remains, the heap holds exactly one element each iteration; the merge keeps emitting in
O(log 1) = O(1)per element.
Worked example. Two streams: s0 = [1, 5] (exhausts first), s1 = [2, 3, 4, 6]. After s0 is drained, the heap holds only s1's heads.
| step | heap | popped | s0 next | s1 next | output |
|---|---|---|---|---|---|
| 1 | [(1,s0), (2,s1)] |
(1,s0) |
5 |
— | [1] |
| 2 | [(2,s1), (5,s0)] |
(2,s1) |
— | 3 |
[1, 2] |
| 3 | [(3,s1), (5,s0)] |
(3,s1) |
— | 4 |
[1, 2, 3] |
| 4 | [(4,s1), (5,s0)] |
(4,s1) |
— | 6 |
[1, 2, 3, 4] |
| 5 | [(5,s0), (6,s1)] |
(5,s0) |
StopIteration | — | [1, 2, 3, 4, 5] |
| 6 | [(6,s1)] |
(6,s1) |
— | StopIteration | [1, 2, 3, 4, 5, 6] |
Worked-example solution.
import heapq
def merge_two(s0, s1):
iters = {0: iter(s0), 1: iter(s1)}
heap = []
for k, it in iters.items():
try:
heapq.heappush(heap, (next(it), k))
except StopIteration:
pass
while heap:
v, k = heapq.heappop(heap)
yield v
try:
heapq.heappush(heap, (next(iters[k]), k))
except StopIteration:
pass # this stream is done — do not push back
list(merge_two([1, 5], [2, 3, 4, 6]))
# [1, 2, 3, 4, 5, 6]
Rule of thumb: the merge ends exactly when the heap is empty—not when all iterators raise StopIteration once. The heap drains on its own.
Common beginner mistakes
- Pushing the whole stream into the heap up front—
O(N)memory instead ofO(K). - Forgetting the
stream_idtag and pushing onlyvalue—then on ties you can't tell which stream to advance. - Pushing a non-comparable payload (dict, custom object without
__lt__)—heapqraisesTypeErrormid-merge on tied values. - Re-pushing from a freshly exhausted stream every iteration—silent infinite loop or
StopIterationthrown to the caller. - Treating
heap[0]as a stable pointer across pushes—it changes; useheappopto consume.
Python Interview Question on Multi-Stream Reader
Given K sorted iterables of integers, return a single iterator that yields all values in globally sorted order. Each input iterable is internally sorted, may be of different length, and may be empty. The output must be lazy (one value at a time) so the caller can stream large inputs without materializing them.
Solution Using heapq k-way merge
import heapq
def multi_stream_reader(streams):
"""Lazy k-way merge of K sorted iterables into one sorted stream."""
iters = [iter(s) for s in streams]
heap = []
for i, it in enumerate(iters):
try:
heapq.heappush(heap, (next(it), i))
except StopIteration:
pass
while heap:
val, i = heapq.heappop(heap)
yield val
try:
heapq.heappush(heap, (next(iters[i]), i))
except StopIteration:
pass
Step-by-step trace (input: streams = [[1, 4, 7], [2, 5, 8], [3, 6, 9]]):
| stream_id | values |
|---|---|
| 0 | 1, 4, 7 |
| 1 | 2, 5, 8 |
| 2 | 3, 6, 9 |
-
Seed the heap — pull the head from each stream: push
(1, 0),(2, 1),(3, 2). Heap now[(1,0), (2,1), (3,2)]. -
Pop
(1, 0)— yield1. Pull next from stream 0 →4. Push(4, 0). Heap[(2,1), (4,0), (3,2)]. -
Pop
(2, 1)— yield2. Pull next from stream 1 →5. Push(5, 1). Heap[(3,2), (4,0), (5,1)]. -
Pop
(3, 2)— yield3. Pull next from stream 2 →6. Push(6, 2). Heap[(4,0), (5,1), (6,2)]. -
Pop
(4, 0)— yield4. Pull next from stream 0 →7. Push(7, 0). Heap[(5,1), (7,0), (6,2)]. -
Pop
(5, 1)— yield5. Pull next from stream 1 →8. Push(8, 1). Heap[(6,2), (7,0), (8,1)]. -
Pop
(6, 2)— yield6. Pull next from stream 2 →9. Push(9, 2). Heap[(7,0), (8,1), (9,2)]. -
Final three pops drain heap — yield
7, then8, then9. After popping(9, 2)and pullingnext(iters[2])raisesStopIteration, the heap is empty and the generator returns.
Output:
| index | yielded |
|---|---|
| 0 | 1 |
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
| 4 | 5 |
| 5 | 6 |
| 6 | 7 |
| 7 | 8 |
| 8 | 9 |
Why this works — concept by concept:
-
Min-heap invariant —
heap[0]is always the global minimum of the active stream heads, so the very next yield is always the correct globally smallest unconsumed value. -
Stream-id tagging — pushing
(value, stream_id)ensures ties break deterministically and tells the loop which stream's iterator to advance after each pop. -
One-head-per-stream rule — the heap holds at most K elements; we never materialize the full input, keeping memory at
O(K)regardless of total size. -
Lazy generator —
yieldmakes the merge producer-friendly: the caller pulls one element at a time, supporting unbounded or very large inputs. -
Exhaustion handling —
try/except StopIterationaroundnext(iters[i])shrinks the heap on drain; no special end-of-stream sentinel is needed. -
Cost —
npops + at mostnpushes on a heap bounded by K →O(n log K)time,O(K)space, wherenis total elements across all streams.
PYTHON
Topic — queue
Queue and priority-queue problems
PYTHON
Topic — design
Design problems
2. Hash Tables for Text Search Autocomplete in Python
Prefix-bucket hash tables for text search autocomplete in Python for data engineering
"Build an autocomplete that returns matching completions for a given prefix" is the canonical hash-table interview prompt for Lyft's place-search and rider-input flows. The mental model: at index time, for every term store it under all of its prefixes; at query time, the prefix is the dictionary key and the value is the list of completions. This trades index-time memory for query-time speed: every query becomes one O(1) dict lookup followed by a bounded slice. The trade-off is exactly what real-time UIs need—keystroke-by-keystroke suggestions can't afford a linear scan of millions of place names.
Pro tip: When the corpus is huge or storing every prefix blows memory, swap to a trie instead of a prefix-keyed dict. The dict approach is O(L²) total prefixes per term of length L; a trie shares prefix paths so storage drops to O(unique nodes). For interview-tier prompts, the prefix-dict version is the expected baseline answer—mention the trie as a follow-up optimization.
defaultdict(list) as a prefix → matches map
collections.defaultdict(list) is the canonical container for "many values per key." The invariant: a key access creates an empty list if missing, so you can append without writing if key not in d: d[key] = []. The structure: prefix -> list of full terms that start with that prefix.
-
from collections import defaultdict— the import. -
buckets = defaultdict(list)— declare with the value-factorylist. -
buckets[prefix].append(term)— append-or-create in one line. -
No
KeyErroron read —buckets["unseen"]returns[]and inserts the empty list as a side effect; use.get(prefix, [])if you don't want the side effect.
Worked example. Index three terms—"airport", "airline", "alpha"—under their length-2 prefixes only.
| term | length-2 prefix | bucket after append |
|---|---|---|
"airport" |
"ai" |
{"ai": ["airport"]} |
"airline" |
"ai" |
{"ai": ["airport", "airline"]} |
"alpha" |
"al" |
{"ai": ["airport", "airline"], "al": ["alpha"]} |
Worked-example solution.
from collections import defaultdict
buckets = defaultdict(list)
for term in ["airport", "airline", "alpha"]:
buckets[term[:2]].append(term)
print(dict(buckets))
# {'ai': ['airport', 'airline'], 'al': ['alpha']}
Rule of thumb: whenever the question is "given a key, give me a list of things," defaultdict(list) is the right primitive; reach for it before dict.setdefault.
Build phase: index every prefix of every term
The autocomplete invariant: every prefix of every term must point to the term. For a term of length L, generate prefixes of length 1, 2, …, L and append the term under each. Build cost is O(sum of L²) across the corpus; query cost drops to O(1) for the lookup plus O(k) for slicing the top k results.
-
Loop
for i in range(1, len(term) + 1)— generate prefixes of every length from 1 up to the full term. -
buckets[term[:i]].append(term)— append the term under each prefix. - Sort each bucket by frequency or rank if the prompt asks for "best matches first."
-
Memory trade-off — terms appear in
Lbuckets; total entries ≈sum(L) * average(L) ≈ N * Lavg.
Worked example. Index "taxi" and "tax" under all prefixes.
| term | prefixes | bucket appends |
|---|---|---|
"tax" |
"t", "ta", "tax"
|
{"t":["tax"], "ta":["tax"], "tax":["tax"]} |
"taxi" |
"t", "ta", "tax", "taxi"
|
{"t":["tax","taxi"], "ta":["tax","taxi"], "tax":["tax","taxi"], "taxi":["taxi"]} |
Worked-example solution.
from collections import defaultdict
def build_index(terms):
buckets = defaultdict(list)
for term in terms:
for i in range(1, len(term) + 1):
buckets[term[:i]].append(term)
return buckets
idx = build_index(["tax", "taxi"])
print(idx["ta"]) # ['tax', 'taxi']
print(idx["taxi"]) # ['taxi']
Rule of thumb: if the same term appears in multiple buckets (which it does, by design), accept the redundancy—query speed pays for it.
Query phase: O(1) lookup + bounded result trim
The query invariant: a prefix lookup is one dict access plus a slice. Return the first k items of buckets[prefix]—or all of them if the bucket is short. Adding a top-k limit keeps response size predictable for UI rendering.
-
buckets.get(prefix, [])— safe lookup; returns[]for unseen prefixes. -
results[:k]— bound the response; useheapq.nsmallest(k, ..., key=...)if you need top-k by rank rather than insertion order. - Sort by frequency at index time once, not per query, if rank matters.
-
Empty-prefix guard — return
[]for""; never hand the caller "all terms."
Worked example. With the index from the previous example, query "ta" and "z" with k = 5.
| prefix | bucket | top-5 result |
|---|---|---|
"ta" |
["tax", "taxi"] |
["tax", "taxi"] |
"z" |
(missing) | [] |
Worked-example solution.
def autocomplete(idx, prefix, k=5):
if not prefix:
return []
return idx.get(prefix, [])[:k]
print(autocomplete(idx, "ta", k=5)) # ['tax', 'taxi']
print(autocomplete(idx, "z", k=5)) # []
Rule of thumb: a one-liner with .get(prefix, [])[:k] covers the empty-bucket and bounded-output cases together.
Common beginner mistakes
- Using
dict[key].append(...)withoutdefaultdict—KeyErroron the first append per key. - Indexing only the full term (
buckets[term].append(term))—prefix queries return nothing. - Storing the same term twice in a bucket because the input list has duplicates and you didn't dedupe.
- Slicing with
result[:k]but forgetting to sort first when the prompt says "top-k by frequency." - Returning the full corpus on empty prefix—always guard
"".
Python Interview Question on Text Search Autocomplete
Build a class Autocomplete that takes a list of unique strings at construction time and supports a method query(prefix, k=5) that returns the first k strings (insertion order is fine) that start with prefix. Empty prefix returns an empty list. Unknown prefix returns an empty list.
Solution Using defaultdict(list) prefix bucket
from collections import defaultdict
class Autocomplete:
def __init__(self, terms):
self.buckets = defaultdict(list)
for term in terms:
for i in range(1, len(term) + 1):
self.buckets[term[:i]].append(term)
def query(self, prefix, k=5):
if not prefix:
return []
return self.buckets.get(prefix, [])[:k]
Step-by-step trace (input: terms = ["airport", "airline", "alpha"], then query("ai", k=5), query("al", k=2), query("", k=5)):
| term | prefixes generated |
|---|---|
"airport" |
"a", "ai", "air", "airp", "airpo", "airpor", "airport"
|
"airline" |
"a", "ai", "air", "airl", "airli", "airlin", "airline"
|
"alpha" |
"a", "al", "alp", "alph", "alpha"
|
-
Build phase, term
"airport"— fori = 1..7, append"airport"to buckets"a","ai","air","airp","airpo","airpor","airport". -
Build phase, term
"airline"— fori = 1..7, append"airline"to buckets"a","ai","air","airl","airli","airlin","airline". Bucket"a"now["airport", "airline"]; bucket"ai"now["airport", "airline"]. -
Build phase, term
"alpha"— fori = 1..5, append"alpha"to buckets"a","al","alp","alph","alpha". Bucket"a"now["airport", "airline", "alpha"]. -
query("ai", k=5)— non-empty prefix →buckets.get("ai", [])returns["airport", "airline"]; slice[:5]returns["airport", "airline"]. -
query("al", k=2)— non-empty prefix →buckets.get("al", [])returns["alpha"]; slice[:2]returns["alpha"]. -
query("", k=5)— empty prefix guard fires; return[]without touchingbuckets.
Output:
| query | result |
|---|---|
query("ai", 5) |
["airport", "airline"] |
query("al", 2) |
["alpha"] |
query("", 5) |
[] |
Why this works — concept by concept:
-
defaultdict(list)— every prefix slot starts as[]automatically; we append-or-create in a single line withoutKeyErrorguards. -
All-prefix indexing — for each term of length
L, we register it under every prefix of lengths 1..L, so any partial keystroke is one dict lookup away from the matching terms. - Insertion-order list — Python lists preserve the order terms were appended, giving stable, deterministic results without a sort step.
-
.get(prefix, [])— safe fallback; an unknown prefix returns the empty list rather than raising or creating a phantom bucket viadefaultdictautovivification. -
Top-k slice —
[:k]bounds the response cheaply; on small per-bucket lists it's effectivelyO(k). - Empty-prefix guard — short-circuits before any lookup so an empty input cannot accidentally return the entire corpus.
-
Cost — build is
O(sum(L_i))≈O(N · L_avg)time and space (each term appears inL_ibuckets); query isO(1)lookup +O(k)slice.
PYTHON
Topic — hash table
Hash table problems
PYTHON
Topic — string
String problems
3. Hash Table Plus Two Pointers for Array Intersection in Python
Frequency dicts and two pointers for array intersection without set() in Python
"Find the intersection of two arrays without using set()" is a deliberate Lyft constraint that forces candidates off the auto-pilot one-liner and into the underlying primitive. The mental model: intersection = elements that appear in both arrays, with multiplicity respected. Two clean approaches: a frequency dictionary built on the smaller array (then scanned against the larger), or a two-pointer walk if both arrays are already sorted. The frequency-dict version is O(n + m); the two-pointer version is O(n + m) after a sort, with O(1) extra space.
Pro tip: "Without
set()" usually means the interviewer wants to see you understand multi-set semantics. Ifa = [1, 1, 2]andb = [1, 2, 2], the answer is[1, 2], not[1, 1, 2, 2]—each match consumes one copy from each side. Decrement on match.
Frequency dict for the smaller array
The frequency-dict invariant: count occurrences in the smaller side, then walk the larger side and decrement. The smaller-side dict bounds memory; scanning the larger array is one pass. Use collections.Counter or a plain dict with .get(x, 0) + 1.
-
Counter(a)— one-line frequency dict; values are integer counts. -
if counts[x] > 0before appending—prevents over-counting once the dict's tally drops to zero. -
counts[x] -= 1on each match—decrement so duplicates don't all match the same single occurrence. - Smaller-side bias — count the shorter array to keep the dict small.
Worked example. a = [1, 2, 2, 3] (smaller), b = [2, 2, 4, 1] (larger). Build counts from a, then walk b.
| step | x | counts before | match? | counts after | result |
|---|---|---|---|---|---|
| 0 | — | {1:1, 2:2, 3:1} |
— | — | [] |
| 1 | 2 | {1:1, 2:2, 3:1} |
yes (2 > 0) | {1:1, 2:1, 3:1} |
[2] |
| 2 | 2 | {1:1, 2:1, 3:1} |
yes (1 > 0) | {1:1, 2:0, 3:1} |
[2, 2] |
| 3 | 4 | {1:1, 2:0, 3:1} |
no (get(4, 0) = 0) |
unchanged | [2, 2] |
| 4 | 1 | {1:1, 2:0, 3:1} |
yes (1 > 0) | {1:0, 2:0, 3:1} |
[2, 2, 1] |
Worked-example solution.
from collections import Counter
def intersect_freq(a, b):
if len(a) > len(b):
a, b = b, a # iterate the smaller
counts = Counter(a)
out = []
for x in b:
if counts.get(x, 0) > 0:
out.append(x)
counts[x] -= 1
return out
intersect_freq([1, 2, 2, 3], [2, 2, 4, 1]) # [2, 2, 1]
Rule of thumb: Counter is the right primitive for multi-set intersection—set discards multiplicity by definition.
Two-pointer alternative when both arrays are sorted
When both inputs are already sorted (or you can afford to sort them), the two-pointer walk uses O(1) extra space. Maintain pointers i and j; at each step compare a[i] and b[j]. On equal, append and advance both. Otherwise, advance the side with the smaller value.
-
Sort first if not already sorted—
O((n + m) log(n + m)); otherwiseO(n + m). -
if a[i] == b[j]— match; append once, advance both. -
elif a[i] < b[j]— advancei. -
else— advancej. - Multi-set semantics are automatic: advancing both on match consumes one copy from each side.
Worked example. a = [1, 2, 2, 3], b = [1, 2, 2, 4]—both sorted. Trace (i, j) after each step.
| step | i | j | a[i] | b[j] | action | result |
|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 1 | match → i+=1, j+=1 | [1] |
| 1 | 1 | 1 | 2 | 2 | match → i+=1, j+=1 | [1, 2] |
| 2 | 2 | 2 | 2 | 2 | match → i+=1, j+=1 | [1, 2, 2] |
| 3 | 3 | 3 | 3 | 4 | a < b → i+=1 | [1, 2, 2] |
| 4 | 4 | 3 | — | — | i out of bounds → stop | [1, 2, 2] |
Worked-example solution.
def intersect_two_pointers(a, b):
a = sorted(a) # noop if already sorted; cheap to be safe
b = sorted(b)
i = j = 0
out = []
while i < len(a) and j < len(b):
if a[i] == b[j]:
out.append(a[i])
i += 1
j += 1
elif a[i] < b[j]:
i += 1
else:
j += 1
return out
intersect_two_pointers([1, 2, 2, 3], [1, 2, 2, 4]) # [1, 2, 2]
Rule of thumb: if both inputs are guaranteed sorted, two-pointer beats the frequency dict on memory; if one side is much bigger and unsorted, the frequency-dict version is simpler.
Multi-set semantics: decrement on match to honor counts
The multi-set invariant: each element in a can match at most one element in b, and vice versa. Without decrementing, a = [1, 1] and b = [1, 1, 1] would emit [1, 1, 1]—wrong. Decrementing the dict (or advancing both pointers) caps the result at min(count_a, count_b) per value.
-
Frequency dict — decrement
counts[x]on each emit. -
Two-pointer — advance both
iandjon match. -
setintersection — collapses both sides to distinct values; loses multiplicity. - Test the boundary — if both arrays have wildly different counts of the same value, multi-set vs. set behavior differs visibly.
Worked example. a = [1, 1, 2], b = [1, 2, 2].
| approach | output |
|---|---|
| frequency dict (decrementing) | [1, 2] |
| two-pointer | [1, 2] |
set(a) & set(b) (for contrast) |
{1, 2} (no multiplicity) |
| naïve "if x in b" without decrement |
[1, 1, 2] (wrong) |
Worked-example solution.
from collections import Counter
def intersect_multiset(a, b):
ca, cb = Counter(a), Counter(b)
out = []
for x, n in ca.items():
if x in cb:
out.extend([x] * min(n, cb[x]))
return out
intersect_multiset([1, 1, 2], [1, 2, 2]) # [1, 2]
Rule of thumb: if the prompt says "intersection," default to multi-set semantics; ask the interviewer to confirm only if the test cases hint at set semantics.
Common beginner mistakes
- Using
set(a) & set(b)and calling it done—loses multiplicity and ignores the "noset()" constraint. - Forgetting to decrement on match—each
aelement matches every duplicate inb. - Iterating the larger array against the smaller dict—wastes the bias; iterate the larger array but keep the dict small.
- Returning
out.append(x)for every appearance inbregardless ofcounts[x]—over-emits whenbhas more copies thana. - Sorting copies in place with
a.sort()—mutates the caller's input; usesorted(a)instead.
Python Interview Question on Array Intersection Without Set
Given two integer arrays a and b, return their intersection as a list, without using set or frozenset. Each element should appear in the result the minimum number of times it appears in either array. Order does not matter, but the multiplicity must be exact.
Solution Using a frequency dict and a single scan
from collections import Counter
def array_intersection(a, b):
if len(a) > len(b):
a, b = b, a # bias: build the dict on the smaller array
counts = Counter(a)
out = []
for x in b:
if counts.get(x, 0) > 0:
out.append(x)
counts[x] -= 1
return out
Step-by-step trace (input: a = [1, 2, 2, 3, 4], b = [2, 2, 4, 4, 5]):
| index | b[index] | counts before | matches? | counts after | output |
|---|---|---|---|---|---|
| — | — | {1:1, 2:2, 3:1, 4:1} |
— | — | [] |
| 0 | 2 | {1:1, 2:2, 3:1, 4:1} |
yes | {1:1, 2:1, 3:1, 4:1} |
[2] |
| 1 | 2 | {1:1, 2:1, 3:1, 4:1} |
yes | {1:1, 2:0, 3:1, 4:1} |
[2, 2] |
| 2 | 4 | {1:1, 2:0, 3:1, 4:1} |
yes | {1:1, 2:0, 3:1, 4:0} |
[2, 2, 4] |
| 3 | 4 | {1:1, 2:0, 3:1, 4:0} |
no (counts[4] = 0) |
unchanged | [2, 2, 4] |
| 4 | 5 | {1:1, 2:0, 3:1, 4:0} |
no (get(5, 0) = 0) |
unchanged | [2, 2, 4] |
-
Bias step —
len(a) = 5,len(b) = 5; arrays are equal length so no swap. Counter built froma:{1:1, 2:2, 3:1, 4:1}. -
b[0] = 2—counts[2] = 2 > 0, append2, decrement to1. -
b[1] = 2—counts[2] = 1 > 0, append2, decrement to0. -
b[2] = 4—counts[4] = 1 > 0, append4, decrement to0. -
b[3] = 4—counts[4] = 0, skip; the third4inbdoesn't match becauseaonly has one4. -
b[4] = 5—counts.get(5, 0) = 0, skip.
Output:
| index | value |
|---|---|
| 0 | 2 |
| 1 | 2 |
| 2 | 4 |
Why this works — concept by concept:
-
Smaller-side dict — building the
Counteron the shorter input bounds memory atO(min(n, m)); we still see every element of the larger side in one linear scan. -
Counter.get(x, 0)— handles "value not ina" withoutKeyError;0correctly signals "no remaining copies to match." -
Decrement-on-match — capping the contribution per value at
counts[x]enforces multi-set semantics: each match consumes one copy from each side. -
Single-pass over
b— exactly one walk over the larger array; no nested scans, noO(n·m)pitfalls. -
Order preservation — output reflects encounter order in
b; cheap, deterministic, and avoids a sort step. -
No
set()dependency — the constraint is honored via the dict; the algorithm doesn't rely on hash-set membership semantics. -
Cost —
O(n + m)time,O(min(n, m))extra space.
PYTHON
Lyft — hash table
Lyft hash-table problems
PYTHON
Topic — two pointers
Two-pointer problems
4. Binary Search for the Nth Missing Integer in Python
Binary search on the count-of-missing invariant in Python for data engineering
"Given a strictly increasing array of positive integers, find the Nth missing positive integer relative to a contiguous [1, 2, 3, …] baseline" looks like a counting puzzle until you spot the invariant: at index i, the number of missing positives below arr[i] is exactly arr[i] - (i + 1) (assuming 0-based indexing). That count is monotonically non-decreasing as i advances—so binary search on the predicate missing(i) >= n finds the boundary in O(log L) time over the array length.
Pro tip: This problem trips candidates who reach for a linear scan first—
O(L)—or expand the array—O(arr[-1]). State the invariant out loud, derive the formula, then announce the binary-search reduction. Interviewers value the explicit naming of the invariant more than they value the optimal complexity.
Missing-count invariant: missing(i) = arr[i] - (i + 1)
The invariant is the heart of the problem. If the array were exactly [1, 2, 3, …], then arr[i] = i + 1, so arr[i] - (i + 1) = 0—zero missing values. Every gap in the array bumps arr[i] past i + 1, and the gap size accumulates: arr[i] - (i + 1) is the count of positive integers from 1 to arr[i] - 1 that are not in the array.
-
Baseline — for the contiguous array
[1, 2, 3, ..., k],arr[i] - (i + 1) = 0everywhere. -
Single gap — drop
3from[1, 2, 4, 5]; ati = 2,arr[2] = 4,(i + 1) = 3, missing =1. -
Multiple gaps accumulate — at every index, the count is the total missing strictly below
arr[i]. -
Strictly non-decreasing — because the array is strictly increasing,
missing(i)never goes down asiadvances.
Worked example. arr = [2, 3, 5, 9]. Compute missing(i) at every index.
| i | arr[i] | (i + 1) | missing(i) | what's missing below arr[i] |
|---|---|---|---|---|
| 0 | 2 | 1 | 1 | {1} |
| 1 | 3 | 2 | 1 | {1} (4 not yet seen) |
| 2 | 5 | 3 | 2 | {1, 4} |
| 3 | 9 | 4 | 5 | {1, 4, 6, 7, 8} |
Worked-example solution.
def missing_at(arr, i):
return arr[i] - (i + 1)
arr = [2, 3, 5, 9]
for i in range(len(arr)):
print(f"i={i}, arr[i]={arr[i]}, missing={missing_at(arr, i)}")
# i=0, arr[i]=2, missing=1
# i=1, arr[i]=3, missing=1
# i=2, arr[i]=5, missing=2
# i=3, arr[i]=9, missing=5
Rule of thumb: when "Nth missing" appears with a sorted array, write arr[i] - (i + 1) on the whiteboard before writing any code.
Binary search on the predicate missing(mid) >= n
The predicate missing(i) >= n is monotone non-decreasing: once true, it stays true. That's the binary-search precondition. Search for the smallest index where the predicate first becomes true; the Nth missing value lives in the gap immediately before that index.
-
Lower bound search —
lo = 0,hi = len(arr)(one past the last index). -
Loop condition —
while lo < hi:. -
Mid —
mid = (lo + hi) // 2. -
Branch — if
missing(mid) < n, the boundary is to the right;lo = mid + 1. Elsehi = mid. -
Termination —
lo == hiis the smallest index withmissing(i) >= n.
Worked example. arr = [2, 3, 5, 9], n = 3. Trace the search.
| step | lo | hi | mid | arr[mid] | missing(mid) | branch | new lo, hi |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 4 | 2 | 5 | 2 |
2 < 3 → right |
lo=3, hi=4 |
| 1 | 3 | 4 | 3 | 9 | 5 |
5 >= 3 → left |
lo=3, hi=3 |
| 2 | 3 | 3 | — | — | — | exit | lo = 3 |
The first index where missing(i) >= 3 is i = 3.
Worked-example solution.
def find_boundary(arr, n):
lo, hi = 0, len(arr)
while lo < hi:
mid = (lo + hi) // 2
if arr[mid] - (mid + 1) < n:
lo = mid + 1
else:
hi = mid
return lo
find_boundary([2, 3, 5, 9], 3) # 3
Rule of thumb: lower-bound binary search on a monotone predicate is the workhorse pattern—reach for it whenever "find the first index where … is true."
Recover the answer: n + lo after the loop
Once the loop returns the boundary index lo, the Nth missing value falls in the gap just before arr[lo] (or after the array end if lo == len(arr)). The closed-form recovery is answer = n + lo: the n accounts for "Nth missing" and lo accounts for the count of elements still in the array below the answer.
-
Why
n + lo— there areloarray elements at indices0..lo-1; their values are at least1, 2, …, lo. The Nth missing positive must come after them, offset by the missing count. -
Edge:
lo == 0— no array elements below the answer; the answer isndirectly. -
Edge:
lo == len(arr)— the answer is past the array end;n + len(arr)is correct because every array element is "below" the answer and consumes a slot. - Strict-increasing assumption — the formula breaks on duplicate or non-sorted inputs.
Worked example. Continuing the search above, lo = 3, n = 3, so answer = 3 + 3 = 6. Verify: missing values are {1, 4, 6, 7, 8}, sorted; the third is 6. ✓
| n | lo | n + lo | the missing values listed |
|---|---|---|---|
| 1 | 0 | 1 | {1, 4, 6, 7, 8} → 1st is 1 |
| 2 | 2 | 4 | 2nd is 4 |
| 3 | 3 | 6 | 3rd is 6 |
| 4 | 3 | 7 | 4th is 7 |
| 5 | 3 | 8 | 5th is 8 |
Worked-example solution.
def nth_missing(arr, n):
lo, hi = 0, len(arr)
while lo < hi:
mid = (lo + hi) // 2
if arr[mid] - (mid + 1) < n:
lo = mid + 1
else:
hi = mid
return n + lo
print(nth_missing([2, 3, 5, 9], 1)) # 1
print(nth_missing([2, 3, 5, 9], 3)) # 6
print(nth_missing([2, 3, 5, 9], 5)) # 8
Rule of thumb: "n + lo" is one of those formulas that looks like magic until you've derived it once; do it on the whiteboard, then never forget it.
Common beginner mistakes
- Doing a linear scan for the first
iwheremissing(i) >= n—correct butO(L), not the askedO(log L). - Off-by-one on the indexing baseline—using
arr[i] - iinstead ofarr[i] - (i + 1)because of 0-vs-1-based confusion. - Off-by-one on the answer recovery—returning
loorn + lo - 1instead ofn + lo. - Initializing
hi = len(arr) - 1—blocks the loop from reaching the past-end case where the answer is beyondarr[-1]. - Forgetting the strict-increasing assumption—the invariant fails silently on inputs with duplicates.
Python Interview Question on the Nth Missing Integer
Given a strictly increasing array of positive integers arr (each value ≥ 1) and a positive integer n, return the Nth positive integer that is missing from arr, comparing against the contiguous baseline [1, 2, 3, ...]. Solve in O(log L) time, where L = len(arr).
Solution Using lower-bound binary search on the missing-count invariant
def find_nth_missing(arr, n):
lo, hi = 0, len(arr)
while lo < hi:
mid = (lo + hi) // 2
# missing positives below arr[mid] = arr[mid] - (mid + 1)
if arr[mid] - (mid + 1) < n:
lo = mid + 1
else:
hi = mid
return n + lo
Step-by-step trace (input: arr = [2, 3, 5, 9, 12], n = 4):
| i | arr[i] | (i + 1) | missing(i) |
|---|---|---|---|
| 0 | 2 | 1 | 1 |
| 1 | 3 | 2 | 1 |
| 2 | 5 | 3 | 2 |
| 3 | 9 | 4 | 5 |
| 4 | 12 | 5 | 7 |
-
Initialize —
lo = 0,hi = 5(array length). -
Iteration 1 —
mid = (0 + 5) // 2 = 2;arr[2] = 5;missing(2) = 5 - 3 = 2.2 < 4→ predicate not yet true; move right:lo = 3. -
Iteration 2 —
mid = (3 + 5) // 2 = 4;arr[4] = 12;missing(4) = 12 - 5 = 7.7 >= 4→ predicate true; move left:hi = 4. -
Iteration 3 —
mid = (3 + 4) // 2 = 3;arr[3] = 9;missing(3) = 9 - 4 = 5.5 >= 4→ predicate true; move left:hi = 3. -
Loop exit —
lo == hi == 3. -
Recover the answer —
n + lo = 4 + 3 = 7. The 4th missing positive is7. -
Spot-check — missing values below 12 are
{1, 4, 6, 7, 8, 10, 11}; the 4th is7. ✓
Output:
| n | nth_missing |
|---|---|
| 1 | 1 |
| 2 | 4 |
| 3 | 6 |
| 4 | 7 |
| 5 | 8 |
Why this works — concept by concept:
-
Missing-count invariant —
arr[i] - (i + 1)exactly counts missing positives strictly belowarr[i], derived from the gap between the actual value and its 1-based positional baseline. -
Monotone predicate — because
arris strictly increasing,missing(i)never decreases; that monotonicity is the precondition for binary search. -
Lower-bound search —
while lo < hiwithhi = len(arr)finds the smallestiwheremissing(i) >= n, the boundary the answer sits just below. -
Inclusive past-end — initializing
hi = len(arr)(notlen(arr) - 1) lets the search converge cleanly when the answer is beyondarr[-1]. -
Closed-form recovery —
n + loworks because there areloarray elements at positions1..lo, and the Nth missing positive necessarily livesnslots beyond them. -
Cost —
O(log L)time (one binary search),O(1)space (no auxiliary structure).
PYTHON
Topic — binary search
Binary-search problems
PYTHON
Lyft — two pointers
Lyft two-pointer problems
5. SQL Time-Series Aggregation for New Customers Per Day
First-touch time-series aggregation in SQL for data engineering
"Count new customers per day" is the canonical first-touch time-series prompt for ride-share analytics. The mental model: a new customer on date D is a customer whose MIN(order_date) = D. The shape is two stacked aggregates: first compute each customer's first-touch date (one row per customer), then count customers per first-touch date (one row per day). The same shape powers daily-active-new-rider reports, weekly cohort sizes, and any first-event-per-entity rollup.
Pro tip: Pin the first-touch step in a CTE—do not nest it as a correlated subquery in the daily-count
SELECT. The CTE form is faster (one scan instead of N), more readable, and matches what an interviewer expects to see when they ask you to talk through the plan.
First-touch per user: MIN(order_date) GROUP BY user_id
The first-touch invariant: MIN(order_date) GROUP BY user_id collapses each user to one row whose date is the user's first interaction. Repeated orders by the same user contribute zero to "new customers" because we keep only the minimum.
-
MIN(order_date)— earliest date per user. -
GROUP BY user_id— one row per user. -
Optional bound —
WHERE order_date >= '2026-04-01'if the report is windowed. -
Type discipline —
MINoverDATEreturnsDATE; overTIMESTAMPreturnsTIMESTAMP. Cast orDATE_TRUNC('day', …)if the time component matters.
Worked example. Five orders across three customers.
| order_id | user_id | order_date |
|---|---|---|
| 1 | 7 | 2026-04-01 |
| 2 | 8 | 2026-04-01 |
| 3 | 7 | 2026-04-02 |
| 4 | 9 | 2026-04-02 |
| 5 | 8 | 2026-04-03 |
After MIN(order_date) GROUP BY user_id:
| user_id | first_order_date |
|---|---|
| 7 | 2026-04-01 |
| 8 | 2026-04-01 |
| 9 | 2026-04-02 |
Worked-example solution.
SELECT
user_id,
MIN(order_date) AS first_order_date
FROM orders
GROUP BY user_id;
Rule of thumb: "first time a thing happened per entity" → MIN(timestamp) GROUP BY entity_id.
Daily roll-up: COUNT(*) GROUP BY first_date
Once each customer has a first-touch date, the daily roll-up is one COUNT over the first-touch column. The result is one row per day with new_customers = COUNT(*). Dates with no new customers do not appear in this output—if you need a fully-densified daily series, left-join against a calendar table.
-
COUNT(*)— counts users (each row in the first-touch CTE is one distinct user). -
GROUP BY first_order_date— bucket by day. -
ORDER BY first_order_date— ascending date for time-series reading. -
Sparse output — days without new customers are absent; densify with a generated date series +
LEFT JOINif needed.
Worked example. Continuing the previous CTE.
| first_order_date | count step |
|---|---|
| 2026-04-01 | 2 (users 7, 8) |
| 2026-04-02 | 1 (user 9) |
After COUNT(*) GROUP BY first_order_date:
| order_day | new_customers |
|---|---|
| 2026-04-01 | 2 |
| 2026-04-02 | 1 |
Worked-example solution.
WITH first_touch AS (
SELECT user_id, MIN(order_date) AS first_order_date
FROM orders
GROUP BY user_id
)
SELECT
first_order_date AS order_day,
COUNT(*) AS new_customers
FROM first_touch
GROUP BY first_order_date
ORDER BY first_order_date;
Rule of thumb: once each user is collapsed to a single first-touch row, COUNT(*) GROUP BY date is the simplest possible roll-up—no DISTINCT needed.
Subquery / CTE composition: feed first-touch into a daily count
The composition invariant: the inner aggregate must be one row per user; the outer aggregate then counts those users per first-touch date. You can express the inner step as a CTE, a derived table, or (rarely) a correlated subquery—stick to CTE for readability and planner friendliness.
-
CTE form —
WITH first_touch AS (...) SELECT ... FROM first_touch. Read top-down; one scan oforders. -
Derived-table form —
FROM (SELECT user_id, MIN(...) FROM orders GROUP BY user_id) ft. Equivalent plan, less readable. -
Correlated subquery (avoid) —
WHERE order_date = (SELECT MIN(order_date) FROM orders o2 WHERE o2.user_id = o.user_id). Slow on most engines unless the optimizer rewrites it. - Idempotency — running the same query twice on stable input yields the same answer; no order-dependent logic.
Worked example. Same orders table; trade the CTE for an equivalent derived table.
| variant | shape | readability |
|---|---|---|
| CTE | WITH first_touch AS (...) |
high |
| derived table | FROM (...) ft |
medium |
| correlated subquery | WHERE order_date = (SELECT MIN ...) |
low + slow |
Worked-example solution.
-- Equivalent derived-table form
SELECT
ft.first_order_date AS order_day,
COUNT(*) AS new_customers
FROM (
SELECT user_id, MIN(order_date) AS first_order_date
FROM orders
GROUP BY user_id
) ft
GROUP BY ft.first_order_date
ORDER BY ft.first_order_date;
Rule of thumb: every "two-stacked-aggregate" question is the same pattern—derive a per-entity summary, then aggregate the summary. Pin the inner step as a CTE.
Common beginner mistakes
- Counting
COUNT(DISTINCT user_id)over the raworderstable grouped byorder_date—counts active users that day, not new customers. - Using
MIN(order_date)withoutGROUP BY user_id—returns one row, the global earliest date. - Forgetting
ORDER BY first_order_datewhen the consumer expects a time-sorted report. - Densifying with a wrong calendar—leaving "0" days as missing or filling with
NULLinstead of0. - Joining
ordersto itself to "find the earliest order"—when a CTE withMIN(order_date)is one line.
SQL Interview Question on New Customers Per Day
Table orders(order_id INT, user_id INT, order_date DATE). Return one row per day with the number of customers whose first order in the system fell on that day. Columns: order_day, new_customers. Sort ascending by order_day.
Solution Using a first-touch CTE and a daily count
WITH first_touch AS (
SELECT user_id, MIN(order_date) AS first_order_date
FROM orders
GROUP BY user_id
)
SELECT
first_order_date AS order_day,
COUNT(*) AS new_customers
FROM first_touch
GROUP BY first_order_date
ORDER BY first_order_date;
Step-by-step trace (input: 7 rows, 4 customers across 4 days):
| order_id | user_id | order_date |
|---|---|---|
| 1 | 1 | 2026-04-01 |
| 2 | 2 | 2026-04-01 |
| 3 | 1 | 2026-04-02 |
| 4 | 3 | 2026-04-02 |
| 5 | 2 | 2026-04-03 |
| 6 | 4 | 2026-04-04 |
| 7 | 1 | 2026-04-04 |
-
Build
first_touch— group by user, takeMIN(order_date). User 1 → 2026-04-01; user 2 → 2026-04-01; user 3 → 2026-04-02; user 4 → 2026-04-04. Four rows total. -
Outer
GROUP BY first_order_date— collapse the four CTE rows by their first-touch date. -
Outer
COUNT(*)— count rows per first-touch date.- 2026-04-01 → users {1, 2} → 2.
- 2026-04-02 → user {3} → 1.
- 2026-04-04 → user {4} → 1.
- 2026-04-03 has no first-touch date → row absent (sparse output).
-
ORDER BY first_order_date— ascending date order. - Final result — three rows; April 3 is correctly absent because no customer's first order fell on that day.
Output:
| order_day | new_customers |
|---|---|
| 2026-04-01 | 2 |
| 2026-04-02 | 1 |
| 2026-04-04 | 1 |
Why this works — concept by concept:
-
First-touch CTE —
MIN(order_date) GROUP BY user_idcollapses each user into one row whose date is their entry into the system; this is the canonical "new" definition. -
Outer aggregate — once each user is represented exactly once with their first-touch date,
COUNT(*) GROUP BY first_order_datecorrectly counts users per "new" day. -
No
DISTINCTneeded — the CTE already deduplicated users, soCOUNT(*)is exactlyCOUNT(DISTINCT user_id)—simpler and faster to plan. - Sparse-by-default — days without first-touches are absent from the output; this is correct unless the prompt explicitly asks for densified zeros via a calendar table.
-
ORDER BY first_order_date— gives the consumer a time-sorted report ready to chart or page through. -
Cost — one full scan of
ordersfor the CTE, one hash aggregate for the outer count →O(N)time andO(U)space whereUis the number of distinct users.
SQL
Topic — time series
Time-series problems
SQL
Lyft — aggregation
Lyft aggregation problems
6. SQL Window Functions for Consecutive-Day Buyer Cohorts
LAG-driven window functions for consecutive-day patterns in SQL for data engineering
"Find customers who bought on two consecutive days" is the canonical window-function prompt for retention. The mental model: for each customer, sort their distinct purchase dates; if any consecutive pair has a one-day gap, that customer qualifies. The structural primitive is LAG(purchase_date) OVER (PARTITION BY user_id ORDER BY purchase_date)—a per-user previous-purchase reference. The predicate is the date-arithmetic equality purchase_date = prev_date + INTERVAL '1 day'.
Pro tip: Pre-deduplicate within a day. If a customer made three purchases on April 2, raw
LAGover the partitioned window will compare two same-day rows and the consecutive-day predicate will misfire.SELECT DISTINCT user_id, purchase_datefirst, then applyLAG.
DISTINCT (user_id, purchase_date) to collapse intra-day duplicates
The dedup invariant: window functions operate on rows; multiple same-day rows distort the LAG comparison. Either project DISTINCT (user_id, purchase_date) in a CTE or GROUP BY (user_id, purchase_date) to one row per (user, day). The downstream window then sees exactly one row per user-day.
-
SELECT DISTINCT user_id, purchase_date— preserves both columns; no aggregates. -
GROUP BY user_id, purchase_date— equivalent; some engines plan slightly differently. -
Casting
TIMESTAMPtoDATE—purchase_ts::datecollapses intra-day timestamps to days before the dedup. -
One row per (user, day) — the contract that
LAG-with-consecutive-day relies on.
Worked example. Three purchases for user 1: two on April 2, one on April 3. Without dedup, LAG sees two same-day rows.
| user_id | purchase_date (raw) |
|---|---|
| 1 | 2026-04-02 |
| 1 | 2026-04-02 |
| 1 | 2026-04-03 |
After DISTINCT:
| user_id | purchase_date |
|---|---|
| 1 | 2026-04-02 |
| 1 | 2026-04-03 |
Worked-example solution.
WITH user_days AS (
SELECT DISTINCT user_id, purchase_date
FROM purchases
)
SELECT * FROM user_days ORDER BY user_id, purchase_date;
Rule of thumb: whenever a date column might have multiple rows per (entity, day), dedup first and apply window functions on the deduped CTE.
LAG(purchase_date) OVER (PARTITION BY user_id ORDER BY purchase_date)
The window-function invariant: LAG(col) returns the value of col from the previous row within the partition, ordered by the same key. For the consecutive-day check, partition by user (so the lookup never crosses customers) and order by date (so "previous" means "previous purchase chronologically").
-
PARTITION BY user_id— each user gets their own ordered window. -
ORDER BY purchase_date— defines what "previous" means. -
LAG(purchase_date)— returns NULL for the first row in each partition (no prior row). -
LAG(col, n)— n-row lag; defaultn = 1.
Worked example. User 1 has purchases on April 1, 2, 5; user 2 has purchases on April 1, 4. After window:
| user_id | purchase_date | prev_date |
|---|---|---|
| 1 | 2026-04-01 | NULL |
| 1 | 2026-04-02 | 2026-04-01 |
| 1 | 2026-04-05 | 2026-04-02 |
| 2 | 2026-04-01 | NULL |
| 2 | 2026-04-04 | 2026-04-01 |
Worked-example solution.
WITH user_days AS (
SELECT DISTINCT user_id, purchase_date
FROM purchases
),
with_prev AS (
SELECT
user_id,
purchase_date,
LAG(purchase_date) OVER (
PARTITION BY user_id
ORDER BY purchase_date
) AS prev_date
FROM user_days
)
SELECT * FROM with_prev ORDER BY user_id, purchase_date;
Rule of thumb: LAG and LEAD are the right primitives for "previous/next event per entity in time order"—not self-joins.
Consecutive predicate: purchase_date = prev_date + INTERVAL '1 day'
The consecutive invariant: two purchases are on consecutive days iff purchase_date - prev_date = 1 day. In SQL flavors with date arithmetic, write prev_date + INTERVAL '1 day'; in pure DATE arithmetic, prev_date + 1 works on PostgreSQL. The predicate goes into a WHERE filter on the windowed CTE.
-
WHERE purchase_date = prev_date + INTERVAL '1 day'— strict equality on +1 day. -
SELECT DISTINCT user_id— collapse multiple consecutive-day pairs per user to one row in the answer. -
NULL safety —
prev_date IS NULLrows (first purchase per user) are filtered out automatically becauseNULL + 1 = NULLandpurchase_date = NULLis never true. -
Generalize to N consecutive days — chain N-1
LAGcalls, or useROW_NUMBERminuspurchase_date(gaps-and-islands).
Worked example. Continuing the windowed table above, apply the predicate.
| user_id | purchase_date | prev_date | consecutive? |
|---|---|---|---|
| 1 | 2026-04-01 | NULL | no |
| 1 | 2026-04-02 | 2026-04-01 | yes |
| 1 | 2026-04-05 | 2026-04-02 | no (gap = 3) |
| 2 | 2026-04-01 | NULL | no |
| 2 | 2026-04-04 | 2026-04-01 | no (gap = 3) |
User 1 qualifies; user 2 does not.
Worked-example solution.
WITH user_days AS (
SELECT DISTINCT user_id, purchase_date
FROM purchases
),
with_prev AS (
SELECT
user_id,
purchase_date,
LAG(purchase_date) OVER (
PARTITION BY user_id
ORDER BY purchase_date
) AS prev_date
FROM user_days
)
SELECT DISTINCT user_id
FROM with_prev
WHERE purchase_date = prev_date + INTERVAL '1 day'
ORDER BY user_id;
Rule of thumb: always pair LAG with DISTINCT on the dedup CTE and DISTINCT user_id in the final projection—otherwise a customer with three consecutive days appears twice.
Common beginner mistakes
- Skipping the
DISTINCT (user_id, purchase_date)step—same-day duplicates makeLAGcompare same-day rows and the predicate misfires. - Using
INTERVAL '1 day'in dialects that don't accept it (e.g., MySQL needsDATE_ADD(prev_date, INTERVAL 1 DAY))—engine-specific syntax matters. - Omitting
PARTITION BY user_id—LAGthen crosses users and produces meaningless "previous-customer-on-different-day" comparisons. - Forgetting
DISTINCT user_idin the final projection—a user with multiple consecutive-day pairs is counted multiple times. - Using
purchase_date - prev_date <= 1—matches same-day duplicates if they survived the dedup; use strict= 1.
SQL Interview Question on Customers Who Bought on 2 Consecutive Days
Table purchases(user_id INT, purchase_date DATE) with possibly multiple rows per (user, date). Return the distinct user_ids of customers who made purchases on at least one pair of consecutive calendar days. Sort ascending by user_id.
Solution Using LAG with a per-user partition
WITH user_days AS (
SELECT DISTINCT user_id, purchase_date
FROM purchases
),
with_prev AS (
SELECT
user_id,
purchase_date,
LAG(purchase_date) OVER (
PARTITION BY user_id
ORDER BY purchase_date
) AS prev_date
FROM user_days
)
SELECT DISTINCT user_id
FROM with_prev
WHERE purchase_date = prev_date + INTERVAL '1 day'
ORDER BY user_id;
Step-by-step trace (input: 9 purchases across 3 customers):
| user_id | purchase_date |
|---|---|
| 1 | 2026-04-01 |
| 1 | 2026-04-02 |
| 1 | 2026-04-02 |
| 1 | 2026-04-05 |
| 2 | 2026-04-01 |
| 2 | 2026-04-04 |
| 3 | 2026-04-10 |
| 3 | 2026-04-11 |
| 3 | 2026-04-12 |
-
Build
user_days—DISTINCT (user_id, purchase_date)collapses the duplicate(1, 2026-04-02)to one row. Result has 8 rows. -
Build
with_prev— for each row, computeLAG(purchase_date) OVER (PARTITION BY user_id ORDER BY purchase_date).
| user_id | purchase_date | prev_date |
|--------:|---------------|-----------|
| 1 | 2026-04-01 | NULL |
| 1 | 2026-04-02 | 2026-04-01 |
| 1 | 2026-04-05 | 2026-04-02 |
| 2 | 2026-04-01 | NULL |
| 2 | 2026-04-04 | 2026-04-01 |
| 3 | 2026-04-10 | NULL |
| 3 | 2026-04-11 | 2026-04-10 |
| 3 | 2026-04-12 | 2026-04-11 |
-
Apply the consecutive predicate —
purchase_date = prev_date + INTERVAL '1 day':- User 1, 2026-04-02 vs. 2026-04-01 → match.
- User 1, 2026-04-05 vs. 2026-04-02 → no (gap = 3).
- User 2, 2026-04-04 vs. 2026-04-01 → no (gap = 3).
- User 3, 2026-04-11 vs. 2026-04-10 → match.
- User 3, 2026-04-12 vs. 2026-04-11 → match (second consecutive pair).
-
SELECT DISTINCT user_id— collapse the three matches across users 1 and 3 to two rows. -
ORDER BY user_id— final ascending result.
Output:
| user_id |
|---|
| 1 |
| 3 |
Why this works — concept by concept:
-
DISTINCT (user_id, purchase_date)dedup — guarantees one row per user-day, so theLAGcomparison is between actual prior-day purchases, not intra-day duplicates that would corrupt the predicate. -
Partitioned window —
PARTITION BY user_id ORDER BY purchase_dategives each customer an isolated, time-ordered view;LAGnever crosses users. -
LAG(purchase_date)— returns the immediately previous purchase date within the partition; NULL on the first row per user, automatically handling "no prior purchase" without a special case. -
Strict-equality predicate —
purchase_date = prev_date + INTERVAL '1 day'matches exactly the consecutive-day case;NULL + 1 = NULL, so first-row NULLs are silently dropped. -
SELECT DISTINCT user_id— collapses multiple qualifying pairs per user (e.g., a 3-day streak produces two pairs) into one answer row. -
Generalizes naturally — for "N consecutive days," chain N-1
LAGcalls or switch to the gaps-and-islandsROW_NUMBER() - purchase_datetrick. -
Cost — one scan of
purchasesfor the dedup CTE, one window pass over the deduped rows, one final scan with the predicate filter →O(N log N)if the engine sorts inside the window, otherwiseO(N)with a hash-partition operator.
SQL
Topic — date arithmetic
Date-arithmetic problems
Tips to Crack Lyft Data Engineering Interviews
Drill the four Lyft Python primitives, in order
The four Python primitives in the curated Lyft Python practice set are heap-backed multi-stream merge, prefix-bucket hash autocomplete, Counter-style array intersection, and binary-search on the Nth-missing-integer invariant. Each maps to a specific module: heapq for #14, collections.defaultdict for #15, collections.Counter for #22, vanilla list-and-arithmetic for #125. Learn the import lines as muscle memory—import heapq, from collections import defaultdict, Counter—and the rest of the prompt becomes pattern recognition.
Treat SQL as a two-step composition
Both Lyft SQL prompts are two stacked aggregates: §5 is "first-touch per user, then count per first-touch date"; §6 is "dedup per user-day, then LAG over partitioned window, then filter." When you see a Lyft SQL prompt, ask yourself: what's the per-entity intermediate, and what's the final aggregate over that intermediate? Pin the intermediate in a CTE, name it after what it is (first_touch, user_days, with_prev), and write the outer query as if the CTE were a real table. The Lyft SQL practice page drills exactly this pattern.
Easy-tier discipline matters more than medium-tier intuition
Three of the six Lyft problems are EASY-tier—#14, #15, #22. Easy at Lyft does not mean trivial; it means the interviewer expects zero hesitation, no off-by-ones, clean variable names, and an articulated invariant. A correct EASY answer with stuttering and one bug is graded worse than a correct MEDIUM answer with the same bug, because the EASY signal is supposed to be "this candidate writes Python fluently." The Lyft easy-tier practice set is the right warmup.
Mobility-data framing on every prompt
Frame your verbal walkthrough in ride-share vocabulary: rider events, driver streams, place autocomplete, route segments, trip-id arrays, daily new riders, return-day cohorts. Interviewers test whether you can translate the abstract algorithm to the business question—matching the interviewer's mental model of what their team builds. The Lyft medium practice set is mostly framed in mobility analytics terms; mirror that vocabulary in your answer.
State the invariant out loud, then write code
For binary-search problems (§4) and window-function problems (§6), the invariant is the answer: arr[i] - (i + 1) = missing(i) for §4, purchase_date = prev_date + INTERVAL '1 day' for §6. Write the invariant on the whiteboard before any code. Interviewers grade the explicit naming of the invariant more than the lines of code that implement it; getting to the answer fast without naming the invariant suggests memorization rather than understanding.
Where to practice on PipeCode
Start with the Lyft practice page for the curated 6-problem set. Then drill the two surface-area pages: Lyft Python practice for #14 / #15 / #22 / #125, Lyft SQL practice for #201 / #202. For broad topic coverage, browse by topic and pick the matching primitives—queue, hash table, two pointers, binary search, window functions, date arithmetic. The interview courses page bundles the SQL and Python courses if you want a structured path before the company drills.
Communication and approach under time pressure
Talk through the invariant first, the brute force second, and the optimal third—even if you go straight to the optimal in code. Interviewers grade process as much as the final answer. Leave 5 minutes at the end of each problem for an edge-case sweep: empty inputs, single-element arrays, all-duplicates, all-distinct, NULL-prev-date in window functions, exhausted streams in heap merges. The most common "almost passed" failure mode is correct happy-path code that crashes on the empty input—a 30-second sweep prevents it.
Frequently Asked Questions
What is the Lyft data engineering interview process like?
Lyft's data engineering interview typically opens with a recruiter screen, then a technical phone screen with one Python or SQL coding problem, then an onsite (or virtual onsite) of four to five rounds: two coding rounds focused on Python data structures and SQL analytics, one data-modeling or system-design discussion (rider/driver pipelines, trip aggregates, real-time event handling), and one or two behavioral rounds. The coding rounds are heavy on the patterns covered in this guide—heap-backed queues, hash-table lookups, two-pointer intersection, binary-search invariants, time-series aggregations, and window-function retention queries.
What Python topics does Lyft test for data engineers?
Lyft Python interviews concentrate on four primitives: heapq for k-way merge and priority queues (#14 Multi Stream Reader), collections.defaultdict for prefix-keyed hash buckets (#15 Text Search Auto-Complete), collections.Counter and frequency dicts for multi-set semantics (#22 Array Intersection Without Set), and lower-bound binary search over a monotone count-of-missing invariant (#125 Nth Missing Integer). All four problems are framed in mobility-data vocabulary—event streams, place autocomplete, route segments, sparse trip-id arrays. Practice the imports as muscle memory: import heapq and from collections import defaultdict, Counter.
How important is SQL for a Lyft data engineering interview?
SQL is one third of Lyft's coding signal—two of the six curated problems are SQL. Both are MEDIUM difficulty and both reward time-series thinking: #201 New Customers Per Day tests first-touch aggregation (MIN(order_date) GROUP BY user_id then COUNT GROUP BY first_date), and #202 Customers Who Bought on 2 Consecutive Days tests LAG over a user-partitioned window plus the consecutive-day predicate purchase_date = prev_date + INTERVAL '1 day'. Expect at least one SQL round where you have to write window-function or two-stage-aggregate SQL on the spot.
How hard are Lyft data engineering interview questions?
The curated Lyft set is 3 easy + 3 medium + 0 hard—a fundamentals-first profile. The hardest of the medium tier is #125 Nth Missing Integer, where the count-of-missing invariant arr[i] - (i + 1) is the crux; without it, candidates default to O(L) linear scans or array expansion. The two SQL mediums (#201 and #202) are harder than they look on the surface: the DISTINCT (user_id, purchase_date) dedup before LAG and the MIN(order_date) GROUP BY user_id first-touch CTE are both invariants that interviewers expect you to articulate, not just code.
Are window functions and consecutive-day patterns common in Lyft interviews?
Yes—#202 Customers Who Bought on 2 Consecutive Days is the explicit test of LAG, partition-by, and date-arithmetic predicates. Lyft's analytics surface (driver retention, rider return frequency, two-trip-day cohorts) leans on this pattern in production, so the interview prompt mirrors real on-the-job SQL. Drill LAG / LEAD over partitioned windows, the gaps-and-islands ROW_NUMBER() - date trick for N-consecutive-day queries, and the strict-equality date predicate. The general window-functions practice page and the date-arithmetic practice page cover both.
How many Lyft practice problems should I solve before the interview?
Solve all 6 problems on the Lyft practice page end-to-end—untimed first, then timed at 25 minutes per problem. After that, broaden to 30 to 40 additional problems spread across the matching topic pages: queue and design, hash table, two pointers, binary search, time series, window functions, date arithmetic. The Lyft Python practice page and Lyft SQL practice page are the right surfaces for the final week of prep before the loop.



Top comments (0)