Gowtham Potureddi

Posted on Apr 30

Lyft Data Engineering Interview Questions

#python #sql #interview #dataengineering

Lyft data engineering interview questions lean Python-fundamentals first with a focused SQL analytics finish: four Python primitives (heap-backed multi-stream merge for ordered event readers, prefix hash buckets for text autocomplete, hash-table plus two-pointer array intersection without set(), and binary search on the arr[i] - (i + 1) invariant for Nth-missing-integer counting), plus two SQL primitives that test ride-share analytics fluency (first-touch time-series aggregation for new-customers-per-day reports and LAG-driven window functions for consecutive-day buyer cohorts). The framings are mobility data—rider event streams, place autocomplete, route-segment intersection, sparse trip-id arrays, daily new-rider counts, and two-day return retention.

This guide walks through the six topic clusters Lyft actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 6-problem Lyft set (3 easy, 3 medium, no hard)—a fundamentals-heavy hub that rewards Python data-structure fluency, the binary-search invariant trick, and clean ride-share SQL over algorithm-puzzle prep.

#	Topic (sections 1–6)	Why it shows up at Lyft
1	Queue design and multi-stream merge in Python	Multi Stream Reader—`heapq` min-heap k-way merge over K sorted iterators.
2	Hash tables for text search autocomplete in Python	Text Search Auto-Complete—`defaultdict(list)` prefix bucket lookup for low-latency place suggestions.
3	Hash table plus two pointers for array intersection in Python	Array Intersection Without Set—`Counter`-style frequency dict with multi-set semantics, no `set()` allowed.
4	Binary search for the Nth missing integer in Python	Nth Missing Integer—`arr[i] - (i + 1)` invariant turns Nth-missing-positive into a clean monotone predicate.
5	SQL time-series aggregation for new customers per day	New Customers Per Day—`MIN(order_date) GROUP BY user_id` first-touch CTE, then `COUNT` per first-touch date.
6	SQL window functions for consecutive-day buyer cohorts	Customers Who Bought on 2 Consecutive Days—`LAG(purchase_date)` over a user-partitioned ordered window.

1. Queue Design and Multi-Stream Merge in Python

Heap-backed queues for multi-stream merge in Python for data engineering

"Read K sorted event streams and emit them as one merged sorted stream" is the canonical queue-design interview prompt at Lyft. The mental model: a min-heap holds the current smallest unconsumed element from each stream; pop the global minimum, write it out, then push the next element from the same stream. This is the standard k-way merge—the same pattern that powers external sort, streaming analytics over partitioned topics, and any system that joins ordered event feeds (rider GPS pings, driver location updates, trip-state transitions) on event-time.

Pro tip: Python's heapq operates on the lowest lexicographic tuple. Push (value, stream_id, item) so ties on value resolve deterministically by stream_id—never push (value, item) where item is non-comparable (a dict, an object without __lt__), or your heap will raise TypeError mid-merge.

`heapq` as a min-heap priority queue

heapq is a heap-ordered list with O(log n) push and pop. The invariant: the smallest element is always at index 0. There is no Heap object—the module provides functions that operate on a regular list, treating it as a binary heap rooted at index 0.

heapq.heappush(h, x) — O(log n) insert; restores heap order after appending.
heapq.heappop(h) — O(log n) remove the smallest; bubbles the last element down.
heapq.heapify(lst) — O(n) in-place heapify of an existing list.
Min-heap only — for a max-heap, push the negation (-x) or wrap with a comparator class.

Worked example. Push the values 5, 2, 8, 1, 3 onto an empty heap one by one and trace the heap after each insertion.

step	operation	heap state	min?
1	push 5	`[5]`	5
2	push 2	`[2, 5]`	2
3	push 8	`[2, 5, 8]`	2
4	push 1	`[1, 2, 8, 5]`	1
5	push 3	`[1, 2, 8, 5, 3]`	1

Worked-example solution.

import heapq

heap = []
for x in [5, 2, 8, 1, 3]:
    heapq.heappush(heap, x)
    print(heap, "min =", heap[0])

# pop them in sorted order
sorted_out = []
while heap:
    sorted_out.append(heapq.heappop(heap))
print(sorted_out)  # [1, 2, 3, 5, 8]

Rule of thumb: if the prompt mentions "smallest of many," "k-way merge," or "top-K," reach for heapq before you reach for sorted().

k-way merge: pull cheapest head, push next from same stream

Given K sorted iterators, the merge invariant is always emit the global minimum head; never advance more than one stream per step. Seed the heap with the first element from each stream tagged with its stream_id. Pop the min (value, stream_id) and write value to the output. Then pull the next element from streams[stream_id] and push it onto the heap. Repeat until the heap is empty.

Tag with stream_id so on a tie the heap can break it deterministically and so the popped value tells you which stream to advance.
One pop, one push keeps the heap size at exactly K (one head per active stream).
Output order is globally sorted because at every step the heap contains every active stream's smallest unconsumed value, so heap[0] is the global minimum.
Total cost — n pops + n pushes over a heap of size at most K → O(n log K) for n total elements across K streams.

Worked example. Three sorted streams: s0 = [1, 4, 7], s1 = [2, 5, 8], s2 = [3, 6, 9]. Trace the heap and the output for the first four steps.

step	heap before pop	popped	next pushed	output so far
1	`[(1,s0), (2,s1), (3,s2)]`	`(1,s0)`	`(4,s0)`	`[1]`
2	`[(2,s1), (4,s0), (3,s2)]`	`(2,s1)`	`(5,s1)`	`[1, 2]`
3	`[(3,s2), (4,s0), (5,s1)]`	`(3,s2)`	`(6,s2)`	`[1, 2, 3]`
4	`[(4,s0), (5,s1), (6,s2)]`	`(4,s0)`	`(7,s0)`	`[1, 2, 3, 4]`

Worked-example solution.

import heapq

def k_way_merge(streams):
    heap = []
    iters = [iter(s) for s in streams]
    for i, it in enumerate(iters):
        try:
            heapq.heappush(heap, (next(it), i))
        except StopIteration:
            pass
    while heap:
        val, i = heapq.heappop(heap)
        yield val
        try:
            heapq.heappush(heap, (next(iters[i]), i))
        except StopIteration:
            pass

list(k_way_merge([[1, 4, 7], [2, 5, 8], [3, 6, 9]]))
# [1, 2, 3, 4, 5, 6, 7, 8, 9]

Rule of thumb: never carry the entire materialized list into the heap; only one head per stream lives in the heap at a time.

Stream exhaustion: pop without re-pushing when iterator drains

A stream is exhausted when its iterator raises StopIteration on next(it). The invariant: on exhaustion, pop without pushing back. The heap then shrinks by one and the merge continues over the remaining K-1 streams.

Use try / except StopIteration around next(it) to detect end of stream.
Heap size monotonically shrinks once any stream exhausts—do not refill from a different stream.
Tail-merge naturally degenerates — when only one stream remains, the heap holds exactly one element each iteration; the merge keeps emitting in O(log 1) = O(1) per element.

Worked example. Two streams: s0 = [1, 5] (exhausts first), s1 = [2, 3, 4, 6]. After s0 is drained, the heap holds only s1's heads.

step	heap	popped	s0 next	s1 next	output
1	`[(1,s0), (2,s1)]`	`(1,s0)`	`5`	—	`[1]`
2	`[(2,s1), (5,s0)]`	`(2,s1)`	—	`3`	`[1, 2]`
3	`[(3,s1), (5,s0)]`	`(3,s1)`	—	`4`	`[1, 2, 3]`
4	`[(4,s1), (5,s0)]`	`(4,s1)`	—	`6`	`[1, 2, 3, 4]`
5	`[(5,s0), (6,s1)]`	`(5,s0)`	StopIteration	—	`[1, 2, 3, 4, 5]`
6	`[(6,s1)]`	`(6,s1)`	—	StopIteration	`[1, 2, 3, 4, 5, 6]`

Worked-example solution.

import heapq

def merge_two(s0, s1):
    iters = {0: iter(s0), 1: iter(s1)}
    heap = []
    for k, it in iters.items():
        try:
            heapq.heappush(heap, (next(it), k))
        except StopIteration:
            pass
    while heap:
        v, k = heapq.heappop(heap)
        yield v
        try:
            heapq.heappush(heap, (next(iters[k]), k))
        except StopIteration:
            pass  # this stream is done — do not push back

list(merge_two([1, 5], [2, 3, 4, 6]))
# [1, 2, 3, 4, 5, 6]

Rule of thumb: the merge ends exactly when the heap is empty—not when all iterators raise StopIteration once. The heap drains on its own.

Common beginner mistakes

Pushing the whole stream into the heap up front—O(N) memory instead of O(K).
Forgetting the stream_id tag and pushing only value—then on ties you can't tell which stream to advance.
Pushing a non-comparable payload (dict, custom object without __lt__)—heapq raises TypeError mid-merge on tied values.
Re-pushing from a freshly exhausted stream every iteration—silent infinite loop or StopIteration thrown to the caller.
Treating heap[0] as a stable pointer across pushes—it changes; use heappop to consume.

Python Interview Question on Multi-Stream Reader

Given K sorted iterables of integers, return a single iterator that yields all values in globally sorted order. Each input iterable is internally sorted, may be of different length, and may be empty. The output must be lazy (one value at a time) so the caller can stream large inputs without materializing them.

Solution Using `heapq` k-way merge

import heapq

def multi_stream_reader(streams):
    """Lazy k-way merge of K sorted iterables into one sorted stream."""
    iters = [iter(s) for s in streams]
    heap = []
    for i, it in enumerate(iters):
        try:
            heapq.heappush(heap, (next(it), i))
        except StopIteration:
            pass
    while heap:
        val, i = heapq.heappop(heap)
        yield val
        try:
            heapq.heappush(heap, (next(iters[i]), i))
        except StopIteration:
            pass

Step-by-step trace (input: streams = [[1, 4, 7], [2, 5, 8], [3, 6, 9]]):

stream_id	values
0	1, 4, 7
1	2, 5, 8
2	3, 6, 9

Seed the heap — pull the head from each stream: push (1, 0), (2, 1), (3, 2). Heap now [(1,0), (2,1), (3,2)].
Pop (1, 0) — yield 1. Pull next from stream 0 → 4. Push (4, 0). Heap [(2,1), (4,0), (3,2)].
Pop (2, 1) — yield 2. Pull next from stream 1 → 5. Push (5, 1). Heap [(3,2), (4,0), (5,1)].
Pop (3, 2) — yield 3. Pull next from stream 2 → 6. Push (6, 2). Heap [(4,0), (5,1), (6,2)].
Pop (4, 0) — yield 4. Pull next from stream 0 → 7. Push (7, 0). Heap [(5,1), (7,0), (6,2)].
Pop (5, 1) — yield 5. Pull next from stream 1 → 8. Push (8, 1). Heap [(6,2), (7,0), (8,1)].
Pop (6, 2) — yield 6. Pull next from stream 2 → 9. Push (9, 2). Heap [(7,0), (8,1), (9,2)].
Final three pops drain heap — yield 7, then 8, then 9. After popping (9, 2) and pulling next(iters[2]) raises StopIteration, the heap is empty and the generator returns.

Output:

index	yielded
0	1
1	2
2	3
3	4
4	5
5	6
6	7
7	8
8	9

Why this works — concept by concept:

Min-heap invariant — heap[0] is always the global minimum of the active stream heads, so the very next yield is always the correct globally smallest unconsumed value.
Stream-id tagging — pushing (value, stream_id) ensures ties break deterministically and tells the loop which stream's iterator to advance after each pop.
One-head-per-stream rule — the heap holds at most K elements; we never materialize the full input, keeping memory at O(K) regardless of total size.
Lazy generator — yield makes the merge producer-friendly: the caller pulls one element at a time, supporting unbounded or very large inputs.
Exhaustion handling — try/except StopIteration around next(iters[i]) shrinks the heap on drain; no special end-of-stream sentinel is needed.
Cost — n pops + at most n pushes on a heap bounded by K → O(n log K) time, O(K) space, where n is total elements across all streams.

PYTHON
Topic — queue
Queue and priority-queue problems

Practice →

PYTHON
Topic — design
Design problems

Practice →

2. Hash Tables for Text Search Autocomplete in Python

Prefix-bucket hash tables for text search autocomplete in Python for data engineering

"Build an autocomplete that returns matching completions for a given prefix" is the canonical hash-table interview prompt for Lyft's place-search and rider-input flows. The mental model: at index time, for every term store it under all of its prefixes; at query time, the prefix is the dictionary key and the value is the list of completions. This trades index-time memory for query-time speed: every query becomes one O(1) dict lookup followed by a bounded slice. The trade-off is exactly what real-time UIs need—keystroke-by-keystroke suggestions can't afford a linear scan of millions of place names.

Pro tip: When the corpus is huge or storing every prefix blows memory, swap to a trie instead of a prefix-keyed dict. The dict approach is O(L²) total prefixes per term of length L; a trie shares prefix paths so storage drops to O(unique nodes). For interview-tier prompts, the prefix-dict version is the expected baseline answer—mention the trie as a follow-up optimization.

`defaultdict(list)` as a prefix → matches map

collections.defaultdict(list) is the canonical container for "many values per key." The invariant: a key access creates an empty list if missing, so you can append without writing if key not in d: d[key] = []. The structure: prefix -> list of full terms that start with that prefix.

from collections import defaultdict — the import.
buckets = defaultdict(list) — declare with the value-factory list.
buckets[prefix].append(term) — append-or-create in one line.
No KeyError on read — buckets["unseen"] returns [] and inserts the empty list as a side effect; use .get(prefix, []) if you don't want the side effect.

Worked example. Index three terms—"airport", "airline", "alpha"—under their length-2 prefixes only.

term	length-2 prefix	bucket after append
`"airport"`	`"ai"`	`{"ai": ["airport"]}`
`"airline"`	`"ai"`	`{"ai": ["airport", "airline"]}`
`"alpha"`	`"al"`	`{"ai": ["airport", "airline"], "al": ["alpha"]}`

Worked-example solution.

from collections import defaultdict

buckets = defaultdict(list)
for term in ["airport", "airline", "alpha"]:
    buckets[term[:2]].append(term)

print(dict(buckets))
# {'ai': ['airport', 'airline'], 'al': ['alpha']}

Rule of thumb: whenever the question is "given a key, give me a list of things," defaultdict(list) is the right primitive; reach for it before dict.setdefault.

Build phase: index every prefix of every term

The autocomplete invariant: every prefix of every term must point to the term. For a term of length L, generate prefixes of length 1, 2, …, L and append the term under each. Build cost is O(sum of L²) across the corpus; query cost drops to O(1) for the lookup plus O(k) for slicing the top k results.

Loop for i in range(1, len(term) + 1) — generate prefixes of every length from 1 up to the full term.
buckets[term[:i]].append(term) — append the term under each prefix.
Sort each bucket by frequency or rank if the prompt asks for "best matches first."
Memory trade-off — terms appear in L buckets; total entries ≈ sum(L) * average(L) ≈ N * Lavg.

Worked example. Index "taxi" and "tax" under all prefixes.

term	prefixes	bucket appends
`"tax"`	`"t"`, `"ta"`, `"tax"`	`{"t":["tax"], "ta":["tax"], "tax":["tax"]}`
`"taxi"`	`"t"`, `"ta"`, `"tax"`, `"taxi"`	`{"t":["tax","taxi"], "ta":["tax","taxi"], "tax":["tax","taxi"], "taxi":["taxi"]}`

Worked-example solution.

from collections import defaultdict

def build_index(terms):
    buckets = defaultdict(list)
    for term in terms:
        for i in range(1, len(term) + 1):
            buckets[term[:i]].append(term)
    return buckets

idx = build_index(["tax", "taxi"])
print(idx["ta"])  # ['tax', 'taxi']
print(idx["taxi"])  # ['taxi']

Rule of thumb: if the same term appears in multiple buckets (which it does, by design), accept the redundancy—query speed pays for it.

Query phase: O(1) lookup + bounded result trim

The query invariant: a prefix lookup is one dict access plus a slice. Return the first k items of buckets[prefix]—or all of them if the bucket is short. Adding a top-k limit keeps response size predictable for UI rendering.

buckets.get(prefix, []) — safe lookup; returns [] for unseen prefixes.
results[:k] — bound the response; use heapq.nsmallest(k, ..., key=...) if you need top-k by rank rather than insertion order.
Sort by frequency at index time once, not per query, if rank matters.
Empty-prefix guard — return [] for ""; never hand the caller "all terms."

Worked example. With the index from the previous example, query "ta" and "z" with k = 5.

prefix	bucket	top-5 result
`"ta"`	`["tax", "taxi"]`	`["tax", "taxi"]`
`"z"`	(missing)	`[]`

Worked-example solution.

def autocomplete(idx, prefix, k=5):
    if not prefix:
        return []
    return idx.get(prefix, [])[:k]

print(autocomplete(idx, "ta", k=5))  # ['tax', 'taxi']
print(autocomplete(idx, "z", k=5))   # []

Rule of thumb: a one-liner with .get(prefix, [])[:k] covers the empty-bucket and bounded-output cases together.

Common beginner mistakes

Using dict[key].append(...) without defaultdict—KeyError on the first append per key.
Indexing only the full term (buckets[term].append(term))—prefix queries return nothing.
Storing the same term twice in a bucket because the input list has duplicates and you didn't dedupe.
Slicing with result[:k] but forgetting to sort first when the prompt says "top-k by frequency."
Returning the full corpus on empty prefix—always guard "".

Python Interview Question on Text Search Autocomplete

Build a class Autocomplete that takes a list of unique strings at construction time and supports a method query(prefix, k=5) that returns the first k strings (insertion order is fine) that start with prefix. Empty prefix returns an empty list. Unknown prefix returns an empty list.

Solution Using `defaultdict(list)` prefix bucket

from collections import defaultdict

class Autocomplete:
    def __init__(self, terms):
        self.buckets = defaultdict(list)
        for term in terms:
            for i in range(1, len(term) + 1):
                self.buckets[term[:i]].append(term)

    def query(self, prefix, k=5):
        if not prefix:
            return []
        return self.buckets.get(prefix, [])[:k]

Step-by-step trace (input: terms = ["airport", "airline", "alpha"], then query("ai", k=5), query("al", k=2), query("", k=5)):

term	prefixes generated
`"airport"`	`"a"`, `"ai"`, `"air"`, `"airp"`, `"airpo"`, `"airpor"`, `"airport"`
`"airline"`	`"a"`, `"ai"`, `"air"`, `"airl"`, `"airli"`, `"airlin"`, `"airline"`
`"alpha"`	`"a"`, `"al"`, `"alp"`, `"alph"`, `"alpha"`

Build phase, term "airport" — for i = 1..7, append "airport" to buckets "a", "ai", "air", "airp", "airpo", "airpor", "airport".
Build phase, term "airline" — for i = 1..7, append "airline" to buckets "a", "ai", "air", "airl", "airli", "airlin", "airline". Bucket "a" now ["airport", "airline"]; bucket "ai" now ["airport", "airline"].
Build phase, term "alpha" — for i = 1..5, append "alpha" to buckets "a", "al", "alp", "alph", "alpha". Bucket "a" now ["airport", "airline", "alpha"].
query("ai", k=5) — non-empty prefix → buckets.get("ai", []) returns ["airport", "airline"]; slice [:5] returns ["airport", "airline"].
query("al", k=2) — non-empty prefix → buckets.get("al", []) returns ["alpha"]; slice [:2] returns ["alpha"].
query("", k=5) — empty prefix guard fires; return [] without touching buckets.

Output:

query	result
`query("ai", 5)`	`["airport", "airline"]`
`query("al", 2)`	`["alpha"]`
`query("", 5)`	`[]`

Why this works — concept by concept:

defaultdict(list) — every prefix slot starts as [] automatically; we append-or-create in a single line without KeyError guards.
All-prefix indexing — for each term of length L, we register it under every prefix of lengths 1..L, so any partial keystroke is one dict lookup away from the matching terms.
Insertion-order list — Python lists preserve the order terms were appended, giving stable, deterministic results without a sort step.
.get(prefix, []) — safe fallback; an unknown prefix returns the empty list rather than raising or creating a phantom bucket via defaultdict autovivification.
Top-k slice — [:k] bounds the response cheaply; on small per-bucket lists it's effectively O(k).
Empty-prefix guard — short-circuits before any lookup so an empty input cannot accidentally return the entire corpus.
Cost — build is O(sum(L_i)) ≈ O(N · L_avg) time and space (each term appears in L_i buckets); query is O(1) lookup + O(k) slice.

PYTHON
Topic — hash table
Hash table problems

Practice →

PYTHON
Topic — string
String problems

Practice →

3. Hash Table Plus Two Pointers for Array Intersection in Python

Frequency dicts and two pointers for array intersection without `set()` in Python

"Find the intersection of two arrays without using set()" is a deliberate Lyft constraint that forces candidates off the auto-pilot one-liner and into the underlying primitive. The mental model: intersection = elements that appear in both arrays, with multiplicity respected. Two clean approaches: a frequency dictionary built on the smaller array (then scanned against the larger), or a two-pointer walk if both arrays are already sorted. The frequency-dict version is O(n + m); the two-pointer version is O(n + m) after a sort, with O(1) extra space.

Pro tip: "Without set()" usually means the interviewer wants to see you understand multi-set semantics. If a = [1, 1, 2] and b = [1, 2, 2], the answer is [1, 2], not [1, 1, 2, 2]—each match consumes one copy from each side. Decrement on match.

Frequency dict for the smaller array

The frequency-dict invariant: count occurrences in the smaller side, then walk the larger side and decrement. The smaller-side dict bounds memory; scanning the larger array is one pass. Use collections.Counter or a plain dict with .get(x, 0) + 1.

Counter(a) — one-line frequency dict; values are integer counts.
if counts[x] > 0 before appending—prevents over-counting once the dict's tally drops to zero.
counts[x] -= 1 on each match—decrement so duplicates don't all match the same single occurrence.
Smaller-side bias — count the shorter array to keep the dict small.

Worked example. a = [1, 2, 2, 3] (smaller), b = [2, 2, 4, 1] (larger). Build counts from a, then walk b.

step	x	counts before	match?	counts after	result
0	—	`{1:1, 2:2, 3:1}`	—	—	`[]`
1	2	`{1:1, 2:2, 3:1}`	yes (2 > 0)	`{1:1, 2:1, 3:1}`	`[2]`
2	2	`{1:1, 2:1, 3:1}`	yes (1 > 0)	`{1:1, 2:0, 3:1}`	`[2, 2]`
3	4	`{1:1, 2:0, 3:1}`	no (`get(4, 0) = 0`)	unchanged	`[2, 2]`
4	1	`{1:1, 2:0, 3:1}`	yes (1 > 0)	`{1:0, 2:0, 3:1}`	`[2, 2, 1]`

Worked-example solution.

from collections import Counter

def intersect_freq(a, b):
    if len(a) > len(b):
        a, b = b, a  # iterate the smaller
    counts = Counter(a)
    out = []
    for x in b:
        if counts.get(x, 0) > 0:
            out.append(x)
            counts[x] -= 1
    return out

intersect_freq([1, 2, 2, 3], [2, 2, 4, 1])  # [2, 2, 1]

Rule of thumb: Counter is the right primitive for multi-set intersection—set discards multiplicity by definition.

Two-pointer alternative when both arrays are sorted

When both inputs are already sorted (or you can afford to sort them), the two-pointer walk uses O(1) extra space. Maintain pointers i and j; at each step compare a[i] and b[j]. On equal, append and advance both. Otherwise, advance the side with the smaller value.

Sort first if not already sorted—O((n + m) log(n + m)); otherwise O(n + m).
if a[i] == b[j] — match; append once, advance both.
elif a[i] < b[j] — advance i.
else — advance j.
Multi-set semantics are automatic: advancing both on match consumes one copy from each side.

Worked example. a = [1, 2, 2, 3], b = [1, 2, 2, 4]—both sorted. Trace (i, j) after each step.

step	i	j	a[i]	b[j]	action	result
0	0	0	1	1	match → i+=1, j+=1	`[1]`
1	1	1	2	2	match → i+=1, j+=1	`[1, 2]`
2	2	2	2	2	match → i+=1, j+=1	`[1, 2, 2]`
3	3	3	3	4	a < b → i+=1	`[1, 2, 2]`
4	4	3	—	—	i out of bounds → stop	`[1, 2, 2]`

Worked-example solution.

def intersect_two_pointers(a, b):
    a = sorted(a)  # noop if already sorted; cheap to be safe
    b = sorted(b)
    i = j = 0
    out = []
    while i < len(a) and j < len(b):
        if a[i] == b[j]:
            out.append(a[i])
            i += 1
            j += 1
        elif a[i] < b[j]:
            i += 1
        else:
            j += 1
    return out

intersect_two_pointers([1, 2, 2, 3], [1, 2, 2, 4])  # [1, 2, 2]

Rule of thumb: if both inputs are guaranteed sorted, two-pointer beats the frequency dict on memory; if one side is much bigger and unsorted, the frequency-dict version is simpler.

Multi-set semantics: decrement on match to honor counts

The multi-set invariant: each element in a can match at most one element in b, and vice versa. Without decrementing, a = [1, 1] and b = [1, 1, 1] would emit [1, 1, 1]—wrong. Decrementing the dict (or advancing both pointers) caps the result at min(count_a, count_b) per value.

Frequency dict — decrement counts[x] on each emit.
Two-pointer — advance both i and j on match.
set intersection — collapses both sides to distinct values; loses multiplicity.
Test the boundary — if both arrays have wildly different counts of the same value, multi-set vs. set behavior differs visibly.

Worked example. a = [1, 1, 2], b = [1, 2, 2].

approach	output
frequency dict (decrementing)	`[1, 2]`
two-pointer	`[1, 2]`
`set(a) & set(b)` (for contrast)	`{1, 2}` (no multiplicity)
naïve "if x in b" without decrement	`[1, 1, 2]` (wrong)

Worked-example solution.

from collections import Counter

def intersect_multiset(a, b):
    ca, cb = Counter(a), Counter(b)
    out = []
    for x, n in ca.items():
        if x in cb:
            out.extend([x] * min(n, cb[x]))
    return out

intersect_multiset([1, 1, 2], [1, 2, 2])  # [1, 2]

Rule of thumb: if the prompt says "intersection," default to multi-set semantics; ask the interviewer to confirm only if the test cases hint at set semantics.

Common beginner mistakes

Using set(a) & set(b) and calling it done—loses multiplicity and ignores the "no set()" constraint.
Forgetting to decrement on match—each a element matches every duplicate in b.
Iterating the larger array against the smaller dict—wastes the bias; iterate the larger array but keep the dict small.
Returning out.append(x) for every appearance in b regardless of counts[x]—over-emits when b has more copies than a.
Sorting copies in place with a.sort()—mutates the caller's input; use sorted(a) instead.

Python Interview Question on Array Intersection Without Set

Given two integer arrays a and b, return their intersection as a list, without using set or frozenset. Each element should appear in the result the minimum number of times it appears in either array. Order does not matter, but the multiplicity must be exact.

Solution Using a frequency dict and a single scan

from collections import Counter

def array_intersection(a, b):
    if len(a) > len(b):
        a, b = b, a  # bias: build the dict on the smaller array
    counts = Counter(a)
    out = []
    for x in b:
        if counts.get(x, 0) > 0:
            out.append(x)
            counts[x] -= 1
    return out

Step-by-step trace (input: a = [1, 2, 2, 3, 4], b = [2, 2, 4, 4, 5]):

index	b[index]	counts before	matches?	counts after	output
—	—	`{1:1, 2:2, 3:1, 4:1}`	—	—	`[]`
0	2	`{1:1, 2:2, 3:1, 4:1}`	yes	`{1:1, 2:1, 3:1, 4:1}`	`[2]`
1	2	`{1:1, 2:1, 3:1, 4:1}`	yes	`{1:1, 2:0, 3:1, 4:1}`	`[2, 2]`
2	4	`{1:1, 2:0, 3:1, 4:1}`	yes	`{1:1, 2:0, 3:1, 4:0}`	`[2, 2, 4]`
3	4	`{1:1, 2:0, 3:1, 4:0}`	no (`counts[4] = 0`)	unchanged	`[2, 2, 4]`
4	5	`{1:1, 2:0, 3:1, 4:0}`	no (`get(5, 0) = 0`)	unchanged	`[2, 2, 4]`

Bias step — len(a) = 5, len(b) = 5; arrays are equal length so no swap. Counter built from a: {1:1, 2:2, 3:1, 4:1}.
b[0] = 2 — counts[2] = 2 > 0, append 2, decrement to 1.
b[1] = 2 — counts[2] = 1 > 0, append 2, decrement to 0.
b[2] = 4 — counts[4] = 1 > 0, append 4, decrement to 0.
b[3] = 4 — counts[4] = 0, skip; the third 4 in b doesn't match because a only has one 4.
b[4] = 5 — counts.get(5, 0) = 0, skip.

Output:

index	value
0	2
1	2
2	4

Why this works — concept by concept:

Smaller-side dict — building the Counter on the shorter input bounds memory at O(min(n, m)); we still see every element of the larger side in one linear scan.
Counter.get(x, 0) — handles "value not in a" without KeyError; 0 correctly signals "no remaining copies to match."
Decrement-on-match — capping the contribution per value at counts[x] enforces multi-set semantics: each match consumes one copy from each side.
Single-pass over b — exactly one walk over the larger array; no nested scans, no O(n·m) pitfalls.
Order preservation — output reflects encounter order in b; cheap, deterministic, and avoids a sort step.
No set() dependency — the constraint is honored via the dict; the algorithm doesn't rely on hash-set membership semantics.
Cost — O(n + m) time, O(min(n, m)) extra space.

PYTHON
Lyft — hash table
Lyft hash-table problems

Practice →

PYTHON
Topic — two pointers
Two-pointer problems

Practice →

4. Binary Search for the Nth Missing Integer in Python

Binary search on the count-of-missing invariant in Python for data engineering

"Given a strictly increasing array of positive integers, find the Nth missing positive integer relative to a contiguous [1, 2, 3, …] baseline" looks like a counting puzzle until you spot the invariant: at index i, the number of missing positives below arr[i] is exactly arr[i] - (i + 1) (assuming 0-based indexing). That count is monotonically non-decreasing as i advances—so binary search on the predicate missing(i) >= n finds the boundary in O(log L) time over the array length.

Pro tip: This problem trips candidates who reach for a linear scan first—O(L)—or expand the array—O(arr[-1]). State the invariant out loud, derive the formula, then announce the binary-search reduction. Interviewers value the explicit naming of the invariant more than they value the optimal complexity.

Missing-count invariant: `missing(i) = arr[i] - (i + 1)`

The invariant is the heart of the problem. If the array were exactly [1, 2, 3, …], then arr[i] = i + 1, so arr[i] - (i + 1) = 0—zero missing values. Every gap in the array bumps arr[i] past i + 1, and the gap size accumulates: arr[i] - (i + 1) is the count of positive integers from 1 to arr[i] - 1 that are not in the array.

Baseline — for the contiguous array [1, 2, 3, ..., k], arr[i] - (i + 1) = 0 everywhere.
Single gap — drop 3 from [1, 2, 4, 5]; at i = 2, arr[2] = 4, (i + 1) = 3, missing = 1.
Multiple gaps accumulate — at every index, the count is the total missing strictly below arr[i].
Strictly non-decreasing — because the array is strictly increasing, missing(i) never goes down as i advances.

Worked example. arr = [2, 3, 5, 9]. Compute missing(i) at every index.

i	arr[i]	(i + 1)	missing(i)	what's missing below arr[i]
0	2	1	1	{1}
1	3	2	1	{1} (4 not yet seen)
2	5	3	2	{1, 4}
3	9	4	5	{1, 4, 6, 7, 8}

Worked-example solution.

def missing_at(arr, i):
    return arr[i] - (i + 1)

arr = [2, 3, 5, 9]
for i in range(len(arr)):
    print(f"i={i}, arr[i]={arr[i]}, missing={missing_at(arr, i)}")
# i=0, arr[i]=2, missing=1
# i=1, arr[i]=3, missing=1
# i=2, arr[i]=5, missing=2
# i=3, arr[i]=9, missing=5

Rule of thumb: when "Nth missing" appears with a sorted array, write arr[i] - (i + 1) on the whiteboard before writing any code.

Binary search on the predicate `missing(mid) >= n`

The predicate missing(i) >= n is monotone non-decreasing: once true, it stays true. That's the binary-search precondition. Search for the smallest index where the predicate first becomes true; the Nth missing value lives in the gap immediately before that index.

Lower bound search — lo = 0, hi = len(arr) (one past the last index).
Loop condition — while lo < hi:.
Mid — mid = (lo + hi) // 2.
Branch — if missing(mid) < n, the boundary is to the right; lo = mid + 1. Else hi = mid.
Termination — lo == hi is the smallest index with missing(i) >= n.

Worked example. arr = [2, 3, 5, 9], n = 3. Trace the search.

step	lo	hi	mid	arr[mid]	missing(mid)	branch	new lo, hi
0	0	4	2	5	2	`2 < 3` → right	lo=3, hi=4
1	3	4	3	9	5	`5 >= 3` → left	lo=3, hi=3
2	3	3	—	—	—	exit	`lo = 3`

The first index where missing(i) >= 3 is i = 3.

Worked-example solution.

def find_boundary(arr, n):
    lo, hi = 0, len(arr)
    while lo < hi:
        mid = (lo + hi) // 2
        if arr[mid] - (mid + 1) < n:
            lo = mid + 1
        else:
            hi = mid
    return lo

find_boundary([2, 3, 5, 9], 3)  # 3

Rule of thumb: lower-bound binary search on a monotone predicate is the workhorse pattern—reach for it whenever "find the first index where … is true."

Recover the answer: `n + lo` after the loop

Once the loop returns the boundary index lo, the Nth missing value falls in the gap just before arr[lo] (or after the array end if lo == len(arr)). The closed-form recovery is answer = n + lo: the n accounts for "Nth missing" and lo accounts for the count of elements still in the array below the answer.

Why n + lo — there are lo array elements at indices 0..lo-1; their values are at least 1, 2, …, lo. The Nth missing positive must come after them, offset by the missing count.
Edge: lo == 0 — no array elements below the answer; the answer is n directly.
Edge: lo == len(arr) — the answer is past the array end; n + len(arr) is correct because every array element is "below" the answer and consumes a slot.
Strict-increasing assumption — the formula breaks on duplicate or non-sorted inputs.

Worked example. Continuing the search above, lo = 3, n = 3, so answer = 3 + 3 = 6. Verify: missing values are {1, 4, 6, 7, 8}, sorted; the third is 6. ✓

n	lo	n + lo	the missing values listed
1	0	1	{1, 4, 6, 7, 8} → 1st is 1
2	2	4	2nd is 4
3	3	6	3rd is 6
4	3	7	4th is 7
5	3	8	5th is 8

Worked-example solution.

def nth_missing(arr, n):
    lo, hi = 0, len(arr)
    while lo < hi:
        mid = (lo + hi) // 2
        if arr[mid] - (mid + 1) < n:
            lo = mid + 1
        else:
            hi = mid
    return n + lo

print(nth_missing([2, 3, 5, 9], 1))  # 1
print(nth_missing([2, 3, 5, 9], 3))  # 6
print(nth_missing([2, 3, 5, 9], 5))  # 8

Rule of thumb: "n + lo" is one of those formulas that looks like magic until you've derived it once; do it on the whiteboard, then never forget it.

Common beginner mistakes

Doing a linear scan for the first i where missing(i) >= n—correct but O(L), not the asked O(log L).
Off-by-one on the indexing baseline—using arr[i] - i instead of arr[i] - (i + 1) because of 0-vs-1-based confusion.
Off-by-one on the answer recovery—returning lo or n + lo - 1 instead of n + lo.
Initializing hi = len(arr) - 1—blocks the loop from reaching the past-end case where the answer is beyond arr[-1].
Forgetting the strict-increasing assumption—the invariant fails silently on inputs with duplicates.

Python Interview Question on the Nth Missing Integer

Given a strictly increasing array of positive integers arr (each value ≥ 1) and a positive integer n, return the Nth positive integer that is missing from arr, comparing against the contiguous baseline [1, 2, 3, ...]. Solve in O(log L) time, where L = len(arr).

Solution Using lower-bound binary search on the missing-count invariant

def find_nth_missing(arr, n):
    lo, hi = 0, len(arr)
    while lo < hi:
        mid = (lo + hi) // 2
        # missing positives below arr[mid] = arr[mid] - (mid + 1)
        if arr[mid] - (mid + 1) < n:
            lo = mid + 1
        else:
            hi = mid
    return n + lo

Step-by-step trace (input: arr = [2, 3, 5, 9, 12], n = 4):

i	arr[i]	(i + 1)	missing(i)
0	2	1	1
1	3	2	1
2	5	3	2
3	9	4	5
4	12	5	7

Initialize — lo = 0, hi = 5 (array length).
Iteration 1 — mid = (0 + 5) // 2 = 2; arr[2] = 5; missing(2) = 5 - 3 = 2. 2 < 4 → predicate not yet true; move right: lo = 3.
Iteration 2 — mid = (3 + 5) // 2 = 4; arr[4] = 12; missing(4) = 12 - 5 = 7. 7 >= 4 → predicate true; move left: hi = 4.
Iteration 3 — mid = (3 + 4) // 2 = 3; arr[3] = 9; missing(3) = 9 - 4 = 5. 5 >= 4 → predicate true; move left: hi = 3.
Loop exit — lo == hi == 3.
Recover the answer — n + lo = 4 + 3 = 7. The 4th missing positive is 7.
Spot-check — missing values below 12 are {1, 4, 6, 7, 8, 10, 11}; the 4th is 7. ✓

Output:

n	nth_missing
1	1
2	4
3	6
4	7
5	8

Why this works — concept by concept:

Missing-count invariant — arr[i] - (i + 1) exactly counts missing positives strictly below arr[i], derived from the gap between the actual value and its 1-based positional baseline.
Monotone predicate — because arr is strictly increasing, missing(i) never decreases; that monotonicity is the precondition for binary search.
Lower-bound search — while lo < hi with hi = len(arr) finds the smallest i where missing(i) >= n, the boundary the answer sits just below.
Inclusive past-end — initializing hi = len(arr) (not len(arr) - 1) lets the search converge cleanly when the answer is beyond arr[-1].
Closed-form recovery — n + lo works because there are lo array elements at positions 1..lo, and the Nth missing positive necessarily lives n slots beyond them.
Cost — O(log L) time (one binary search), O(1) space (no auxiliary structure).

PYTHON
Topic — binary search
Binary-search problems

Practice →

PYTHON
Lyft — two pointers
Lyft two-pointer problems

Practice →

5. SQL Time-Series Aggregation for New Customers Per Day

First-touch time-series aggregation in SQL for data engineering

"Count new customers per day" is the canonical first-touch time-series prompt for ride-share analytics. The mental model: a new customer on date D is a customer whose MIN(order_date) = D. The shape is two stacked aggregates: first compute each customer's first-touch date (one row per customer), then count customers per first-touch date (one row per day). The same shape powers daily-active-new-rider reports, weekly cohort sizes, and any first-event-per-entity rollup.

Pro tip: Pin the first-touch step in a CTE—do not nest it as a correlated subquery in the daily-count SELECT. The CTE form is faster (one scan instead of N), more readable, and matches what an interviewer expects to see when they ask you to talk through the plan.

First-touch per user: `MIN(order_date) GROUP BY user_id`

The first-touch invariant: MIN(order_date) GROUP BY user_id collapses each user to one row whose date is the user's first interaction. Repeated orders by the same user contribute zero to "new customers" because we keep only the minimum.

MIN(order_date) — earliest date per user.
GROUP BY user_id — one row per user.
Optional bound — WHERE order_date >= '2026-04-01' if the report is windowed.
Type discipline — MIN over DATE returns DATE; over TIMESTAMP returns TIMESTAMP. Cast or DATE_TRUNC('day', …) if the time component matters.

Worked example. Five orders across three customers.

order_id	user_id	order_date
1	7	2026-04-01
2	8	2026-04-01
3	7	2026-04-02
4	9	2026-04-02
5	8	2026-04-03

After MIN(order_date) GROUP BY user_id:

user_id	first_order_date
7	2026-04-01
8	2026-04-01
9	2026-04-02

Worked-example solution.

SELECT
  user_id,
  MIN(order_date) AS first_order_date
FROM orders
GROUP BY user_id;

Rule of thumb: "first time a thing happened per entity" → MIN(timestamp) GROUP BY entity_id.

Daily roll-up: `COUNT(*) GROUP BY first_date`

Once each customer has a first-touch date, the daily roll-up is one COUNT over the first-touch column. The result is one row per day with new_customers = COUNT(*). Dates with no new customers do not appear in this output—if you need a fully-densified daily series, left-join against a calendar table.

COUNT(*) — counts users (each row in the first-touch CTE is one distinct user).
GROUP BY first_order_date — bucket by day.
ORDER BY first_order_date — ascending date for time-series reading.
Sparse output — days without new customers are absent; densify with a generated date series + LEFT JOIN if needed.

Worked example. Continuing the previous CTE.

first_order_date	count step
2026-04-01	2 (users 7, 8)
2026-04-02	1 (user 9)

After COUNT(*) GROUP BY first_order_date:

order_day	new_customers
2026-04-01	2
2026-04-02	1

Worked-example solution.

WITH first_touch AS (
  SELECT user_id, MIN(order_date) AS first_order_date
  FROM orders
  GROUP BY user_id
)
SELECT
  first_order_date AS order_day,
  COUNT(*) AS new_customers
FROM first_touch
GROUP BY first_order_date
ORDER BY first_order_date;

Rule of thumb: once each user is collapsed to a single first-touch row, COUNT(*) GROUP BY date is the simplest possible roll-up—no DISTINCT needed.

Subquery / CTE composition: feed first-touch into a daily count

The composition invariant: the inner aggregate must be one row per user; the outer aggregate then counts those users per first-touch date. You can express the inner step as a CTE, a derived table, or (rarely) a correlated subquery—stick to CTE for readability and planner friendliness.

CTE form — WITH first_touch AS (...) SELECT ... FROM first_touch. Read top-down; one scan of orders.
Derived-table form — FROM (SELECT user_id, MIN(...) FROM orders GROUP BY user_id) ft. Equivalent plan, less readable.
Correlated subquery (avoid) — WHERE order_date = (SELECT MIN(order_date) FROM orders o2 WHERE o2.user_id = o.user_id). Slow on most engines unless the optimizer rewrites it.
Idempotency — running the same query twice on stable input yields the same answer; no order-dependent logic.

Worked example. Same orders table; trade the CTE for an equivalent derived table.

variant	shape	readability
CTE	`WITH first_touch AS (...)`	high
derived table	`FROM (...) ft`	medium
correlated subquery	`WHERE order_date = (SELECT MIN ...)`	low + slow

Worked-example solution.

-- Equivalent derived-table form
SELECT
  ft.first_order_date AS order_day,
  COUNT(*) AS new_customers
FROM (
  SELECT user_id, MIN(order_date) AS first_order_date
  FROM orders
  GROUP BY user_id
) ft
GROUP BY ft.first_order_date
ORDER BY ft.first_order_date;

Rule of thumb: every "two-stacked-aggregate" question is the same pattern—derive a per-entity summary, then aggregate the summary. Pin the inner step as a CTE.

Common beginner mistakes

Counting COUNT(DISTINCT user_id) over the raw orders table grouped by order_date—counts active users that day, not new customers.
Using MIN(order_date) without GROUP BY user_id—returns one row, the global earliest date.
Forgetting ORDER BY first_order_date when the consumer expects a time-sorted report.
Densifying with a wrong calendar—leaving "0" days as missing or filling with NULL instead of 0.
Joining orders to itself to "find the earliest order"—when a CTE with MIN(order_date) is one line.

SQL Interview Question on New Customers Per Day

Table orders(order_id INT, user_id INT, order_date DATE). Return one row per day with the number of customers whose first order in the system fell on that day. Columns: order_day, new_customers. Sort ascending by order_day.

Solution Using a first-touch CTE and a daily count

WITH first_touch AS (
  SELECT user_id, MIN(order_date) AS first_order_date
  FROM orders
  GROUP BY user_id
)
SELECT
  first_order_date AS order_day,
  COUNT(*) AS new_customers
FROM first_touch
GROUP BY first_order_date
ORDER BY first_order_date;

Step-by-step trace (input: 7 rows, 4 customers across 4 days):

order_id	user_id	order_date
1	1	2026-04-01
2	2	2026-04-01
3	1	2026-04-02
4	3	2026-04-02
5	2	2026-04-03
6	4	2026-04-04
7	1	2026-04-04

Build first_touch — group by user, take MIN(order_date). User 1 → 2026-04-01; user 2 → 2026-04-01; user 3 → 2026-04-02; user 4 → 2026-04-04. Four rows total.
Outer GROUP BY first_order_date — collapse the four CTE rows by their first-touch date.
Outer COUNT(*) — count rows per first-touch date.
- 2026-04-01 → users {1, 2} → 2.
- 2026-04-02 → user {3} → 1.
- 2026-04-04 → user {4} → 1.
- 2026-04-03 has no first-touch date → row absent (sparse output).
ORDER BY first_order_date — ascending date order.
Final result — three rows; April 3 is correctly absent because no customer's first order fell on that day.

Output:

order_day	new_customers
2026-04-01	2
2026-04-02	1
2026-04-04	1

Why this works — concept by concept:

First-touch CTE — MIN(order_date) GROUP BY user_id collapses each user into one row whose date is their entry into the system; this is the canonical "new" definition.
Outer aggregate — once each user is represented exactly once with their first-touch date, COUNT(*) GROUP BY first_order_date correctly counts users per "new" day.
No DISTINCT needed — the CTE already deduplicated users, so COUNT(*) is exactly COUNT(DISTINCT user_id)—simpler and faster to plan.
Sparse-by-default — days without first-touches are absent from the output; this is correct unless the prompt explicitly asks for densified zeros via a calendar table.
ORDER BY first_order_date — gives the consumer a time-sorted report ready to chart or page through.
Cost — one full scan of orders for the CTE, one hash aggregate for the outer count → O(N) time and O(U) space where U is the number of distinct users.

SQL
Topic — time series
Time-series problems

Practice →

SQL
Lyft — aggregation
Lyft aggregation problems

Practice →

6. SQL Window Functions for Consecutive-Day Buyer Cohorts

`LAG`-driven window functions for consecutive-day patterns in SQL for data engineering

"Find customers who bought on two consecutive days" is the canonical window-function prompt for retention. The mental model: for each customer, sort their distinct purchase dates; if any consecutive pair has a one-day gap, that customer qualifies. The structural primitive is LAG(purchase_date) OVER (PARTITION BY user_id ORDER BY purchase_date)—a per-user previous-purchase reference. The predicate is the date-arithmetic equality purchase_date = prev_date + INTERVAL '1 day'.

Pro tip: Pre-deduplicate within a day. If a customer made three purchases on April 2, raw LAG over the partitioned window will compare two same-day rows and the consecutive-day predicate will misfire. SELECT DISTINCT user_id, purchase_date first, then apply LAG.

`DISTINCT (user_id, purchase_date)` to collapse intra-day duplicates

The dedup invariant: window functions operate on rows; multiple same-day rows distort the LAG comparison. Either project DISTINCT (user_id, purchase_date) in a CTE or GROUP BY (user_id, purchase_date) to one row per (user, day). The downstream window then sees exactly one row per user-day.

SELECT DISTINCT user_id, purchase_date — preserves both columns; no aggregates.
GROUP BY user_id, purchase_date — equivalent; some engines plan slightly differently.
Casting TIMESTAMP to DATE — purchase_ts::date collapses intra-day timestamps to days before the dedup.
One row per (user, day) — the contract that LAG-with-consecutive-day relies on.

Worked example. Three purchases for user 1: two on April 2, one on April 3. Without dedup, LAG sees two same-day rows.

user_id	purchase_date (raw)
1	2026-04-02
1	2026-04-02
1	2026-04-03

After DISTINCT:

user_id	purchase_date
1	2026-04-02
1	2026-04-03

Worked-example solution.

WITH user_days AS (
  SELECT DISTINCT user_id, purchase_date
  FROM purchases
)
SELECT * FROM user_days ORDER BY user_id, purchase_date;

Rule of thumb: whenever a date column might have multiple rows per (entity, day), dedup first and apply window functions on the deduped CTE.

`LAG(purchase_date) OVER (PARTITION BY user_id ORDER BY purchase_date)`

The window-function invariant: LAG(col) returns the value of col from the previous row within the partition, ordered by the same key. For the consecutive-day check, partition by user (so the lookup never crosses customers) and order by date (so "previous" means "previous purchase chronologically").

PARTITION BY user_id — each user gets their own ordered window.
ORDER BY purchase_date — defines what "previous" means.
LAG(purchase_date) — returns NULL for the first row in each partition (no prior row).
LAG(col, n) — n-row lag; default n = 1.

Worked example. User 1 has purchases on April 1, 2, 5; user 2 has purchases on April 1, 4. After window:

user_id	purchase_date	prev_date
1	2026-04-01	NULL
1	2026-04-02	2026-04-01
1	2026-04-05	2026-04-02
2	2026-04-01	NULL
2	2026-04-04	2026-04-01

Worked-example solution.

WITH user_days AS (
  SELECT DISTINCT user_id, purchase_date
  FROM purchases
),
with_prev AS (
  SELECT
    user_id,
    purchase_date,
    LAG(purchase_date) OVER (
      PARTITION BY user_id
      ORDER BY purchase_date
    ) AS prev_date
  FROM user_days
)
SELECT * FROM with_prev ORDER BY user_id, purchase_date;

Rule of thumb: LAG and LEAD are the right primitives for "previous/next event per entity in time order"—not self-joins.

Consecutive predicate: `purchase_date = prev_date + INTERVAL '1 day'`

The consecutive invariant: two purchases are on consecutive days iff purchase_date - prev_date = 1 day. In SQL flavors with date arithmetic, write prev_date + INTERVAL '1 day'; in pure DATE arithmetic, prev_date + 1 works on PostgreSQL. The predicate goes into a WHERE filter on the windowed CTE.

WHERE purchase_date = prev_date + INTERVAL '1 day' — strict equality on +1 day.
SELECT DISTINCT user_id — collapse multiple consecutive-day pairs per user to one row in the answer.
NULL safety — prev_date IS NULL rows (first purchase per user) are filtered out automatically because NULL + 1 = NULL and purchase_date = NULL is never true.
Generalize to N consecutive days — chain N-1 LAG calls, or use ROW_NUMBER minus purchase_date (gaps-and-islands).

Worked example. Continuing the windowed table above, apply the predicate.

user_id	purchase_date	prev_date	consecutive?
1	2026-04-01	NULL	no
1	2026-04-02	2026-04-01	yes
1	2026-04-05	2026-04-02	no (gap = 3)
2	2026-04-01	NULL	no
2	2026-04-04	2026-04-01	no (gap = 3)

User 1 qualifies; user 2 does not.

Worked-example solution.

WITH user_days AS (
  SELECT DISTINCT user_id, purchase_date
  FROM purchases
),
with_prev AS (
  SELECT
    user_id,
    purchase_date,
    LAG(purchase_date) OVER (
      PARTITION BY user_id
      ORDER BY purchase_date
    ) AS prev_date
  FROM user_days
)
SELECT DISTINCT user_id
FROM with_prev
WHERE purchase_date = prev_date + INTERVAL '1 day'
ORDER BY user_id;

Rule of thumb: always pair LAG with DISTINCT on the dedup CTE and DISTINCT user_id in the final projection—otherwise a customer with three consecutive days appears twice.

Common beginner mistakes

Skipping the DISTINCT (user_id, purchase_date) step—same-day duplicates make LAG compare same-day rows and the predicate misfires.
Using INTERVAL '1 day' in dialects that don't accept it (e.g., MySQL needs DATE_ADD(prev_date, INTERVAL 1 DAY))—engine-specific syntax matters.
Omitting PARTITION BY user_id—LAG then crosses users and produces meaningless "previous-customer-on-different-day" comparisons.
Forgetting DISTINCT user_id in the final projection—a user with multiple consecutive-day pairs is counted multiple times.
Using purchase_date - prev_date <= 1—matches same-day duplicates if they survived the dedup; use strict = 1.

SQL Interview Question on Customers Who Bought on 2 Consecutive Days

Table purchases(user_id INT, purchase_date DATE) with possibly multiple rows per (user, date). Return the distinct user_ids of customers who made purchases on at least one pair of consecutive calendar days. Sort ascending by user_id.

Solution Using `LAG` with a per-user partition

WITH user_days AS (
  SELECT DISTINCT user_id, purchase_date
  FROM purchases
),
with_prev AS (
  SELECT
    user_id,
    purchase_date,
    LAG(purchase_date) OVER (
      PARTITION BY user_id
      ORDER BY purchase_date
    ) AS prev_date
  FROM user_days
)
SELECT DISTINCT user_id
FROM with_prev
WHERE purchase_date = prev_date + INTERVAL '1 day'
ORDER BY user_id;

Step-by-step trace (input: 9 purchases across 3 customers):

user_id	purchase_date
1	2026-04-01
1	2026-04-02
1	2026-04-02
1	2026-04-05
2	2026-04-01
2	2026-04-04
3	2026-04-10
3	2026-04-11
3	2026-04-12

Build user_days — DISTINCT (user_id, purchase_date) collapses the duplicate (1, 2026-04-02) to one row. Result has 8 rows.
Build with_prev — for each row, compute LAG(purchase_date) OVER (PARTITION BY user_id ORDER BY purchase_date).

| user_id | purchase_date | prev_date |
|--------:|---------------|-----------|
| 1 | 2026-04-01 | NULL |
| 1 | 2026-04-02 | 2026-04-01 |
| 1 | 2026-04-05 | 2026-04-02 |
| 2 | 2026-04-01 | NULL |
| 2 | 2026-04-04 | 2026-04-01 |
| 3 | 2026-04-10 | NULL |
| 3 | 2026-04-11 | 2026-04-10 |
| 3 | 2026-04-12 | 2026-04-11 |

Apply the consecutive predicate — purchase_date = prev_date + INTERVAL '1 day':
- User 1, 2026-04-02 vs. 2026-04-01 → match.
- User 1, 2026-04-05 vs. 2026-04-02 → no (gap = 3).
- User 2, 2026-04-04 vs. 2026-04-01 → no (gap = 3).
- User 3, 2026-04-11 vs. 2026-04-10 → match.
- User 3, 2026-04-12 vs. 2026-04-11 → match (second consecutive pair).
SELECT DISTINCT user_id — collapse the three matches across users 1 and 3 to two rows.
ORDER BY user_id — final ascending result.

Output:

user_id
1
3

Why this works — concept by concept:

DISTINCT (user_id, purchase_date) dedup — guarantees one row per user-day, so the LAG comparison is between actual prior-day purchases, not intra-day duplicates that would corrupt the predicate.
Partitioned window — PARTITION BY user_id ORDER BY purchase_date gives each customer an isolated, time-ordered view; LAG never crosses users.
LAG(purchase_date) — returns the immediately previous purchase date within the partition; NULL on the first row per user, automatically handling "no prior purchase" without a special case.
Strict-equality predicate — purchase_date = prev_date + INTERVAL '1 day' matches exactly the consecutive-day case; NULL + 1 = NULL, so first-row NULLs are silently dropped.
SELECT DISTINCT user_id — collapses multiple qualifying pairs per user (e.g., a 3-day streak produces two pairs) into one answer row.
Generalizes naturally — for "N consecutive days," chain N-1 LAG calls or switch to the gaps-and-islands ROW_NUMBER() - purchase_date trick.
Cost — one scan of purchases for the dedup CTE, one window pass over the deduped rows, one final scan with the predicate filter → O(N log N) if the engine sorts inside the window, otherwise O(N) with a hash-partition operator.

SQL Topic — window functions Window-function problems

Practice →

SQL
Topic — date arithmetic
Date-arithmetic problems

Practice →

Tips to Crack Lyft Data Engineering Interviews

Drill the four Lyft Python primitives, in order

The four Python primitives in the curated Lyft Python practice set are heap-backed multi-stream merge, prefix-bucket hash autocomplete, Counter-style array intersection, and binary-search on the Nth-missing-integer invariant. Each maps to a specific module: heapq for #14, collections.defaultdict for #15, collections.Counter for #22, vanilla list-and-arithmetic for #125. Learn the import lines as muscle memory—import heapq, from collections import defaultdict, Counter—and the rest of the prompt becomes pattern recognition.

Treat SQL as a two-step composition

Both Lyft SQL prompts are two stacked aggregates: §5 is "first-touch per user, then count per first-touch date"; §6 is "dedup per user-day, then LAG over partitioned window, then filter." When you see a Lyft SQL prompt, ask yourself: what's the per-entity intermediate, and what's the final aggregate over that intermediate? Pin the intermediate in a CTE, name it after what it is (first_touch, user_days, with_prev), and write the outer query as if the CTE were a real table. The Lyft SQL practice page drills exactly this pattern.

Easy-tier discipline matters more than medium-tier intuition

Three of the six Lyft problems are EASY-tier—#14, #15, #22. Easy at Lyft does not mean trivial; it means the interviewer expects zero hesitation, no off-by-ones, clean variable names, and an articulated invariant. A correct EASY answer with stuttering and one bug is graded worse than a correct MEDIUM answer with the same bug, because the EASY signal is supposed to be "this candidate writes Python fluently." The Lyft easy-tier practice set is the right warmup.

Mobility-data framing on every prompt

Frame your verbal walkthrough in ride-share vocabulary: rider events, driver streams, place autocomplete, route segments, trip-id arrays, daily new riders, return-day cohorts. Interviewers test whether you can translate the abstract algorithm to the business question—matching the interviewer's mental model of what their team builds. The Lyft medium practice set is mostly framed in mobility analytics terms; mirror that vocabulary in your answer.

State the invariant out loud, then write code

For binary-search problems (§4) and window-function problems (§6), the invariant is the answer: arr[i] - (i + 1) = missing(i) for §4, purchase_date = prev_date + INTERVAL '1 day' for §6. Write the invariant on the whiteboard before any code. Interviewers grade the explicit naming of the invariant more than the lines of code that implement it; getting to the answer fast without naming the invariant suggests memorization rather than understanding.

Where to practice on PipeCode

Start with the Lyft practice page for the curated 6-problem set. Then drill the two surface-area pages: Lyft Python practice for #14 / #15 / #22 / #125, Lyft SQL practice for #201 / #202. For broad topic coverage, browse by topic and pick the matching primitives—queue, hash table, two pointers, binary search, window functions, date arithmetic. The interview courses page bundles the SQL and Python courses if you want a structured path before the company drills.

Communication and approach under time pressure

Talk through the invariant first, the brute force second, and the optimal third—even if you go straight to the optimal in code. Interviewers grade process as much as the final answer. Leave 5 minutes at the end of each problem for an edge-case sweep: empty inputs, single-element arrays, all-duplicates, all-distinct, NULL-prev-date in window functions, exhausted streams in heap merges. The most common "almost passed" failure mode is correct happy-path code that crashes on the empty input—a 30-second sweep prevents it.

Frequently Asked Questions

What is the Lyft data engineering interview process like?

Lyft's data engineering interview typically opens with a recruiter screen, then a technical phone screen with one Python or SQL coding problem, then an onsite (or virtual onsite) of four to five rounds: two coding rounds focused on Python data structures and SQL analytics, one data-modeling or system-design discussion (rider/driver pipelines, trip aggregates, real-time event handling), and one or two behavioral rounds. The coding rounds are heavy on the patterns covered in this guide—heap-backed queues, hash-table lookups, two-pointer intersection, binary-search invariants, time-series aggregations, and window-function retention queries.

What Python topics does Lyft test for data engineers?

Lyft Python interviews concentrate on four primitives: heapq for k-way merge and priority queues (#14 Multi Stream Reader), collections.defaultdict for prefix-keyed hash buckets (#15 Text Search Auto-Complete), collections.Counter and frequency dicts for multi-set semantics (#22 Array Intersection Without Set), and lower-bound binary search over a monotone count-of-missing invariant (#125 Nth Missing Integer). All four problems are framed in mobility-data vocabulary—event streams, place autocomplete, route segments, sparse trip-id arrays. Practice the imports as muscle memory: import heapq and from collections import defaultdict, Counter.

How important is SQL for a Lyft data engineering interview?

SQL is one third of Lyft's coding signal—two of the six curated problems are SQL. Both are MEDIUM difficulty and both reward time-series thinking: #201 New Customers Per Day tests first-touch aggregation (MIN(order_date) GROUP BY user_id then COUNT GROUP BY first_date), and #202 Customers Who Bought on 2 Consecutive Days tests LAG over a user-partitioned window plus the consecutive-day predicate purchase_date = prev_date + INTERVAL '1 day'. Expect at least one SQL round where you have to write window-function or two-stage-aggregate SQL on the spot.

How hard are Lyft data engineering interview questions?

The curated Lyft set is 3 easy + 3 medium + 0 hard—a fundamentals-first profile. The hardest of the medium tier is #125 Nth Missing Integer, where the count-of-missing invariant arr[i] - (i + 1) is the crux; without it, candidates default to O(L) linear scans or array expansion. The two SQL mediums (#201 and #202) are harder than they look on the surface: the DISTINCT (user_id, purchase_date) dedup before LAG and the MIN(order_date) GROUP BY user_id first-touch CTE are both invariants that interviewers expect you to articulate, not just code.

Are window functions and consecutive-day patterns common in Lyft interviews?

Yes—#202 Customers Who Bought on 2 Consecutive Days is the explicit test of LAG, partition-by, and date-arithmetic predicates. Lyft's analytics surface (driver retention, rider return frequency, two-trip-day cohorts) leans on this pattern in production, so the interview prompt mirrors real on-the-job SQL. Drill LAG / LEAD over partitioned windows, the gaps-and-islands ROW_NUMBER() - date trick for N-consecutive-day queries, and the strict-equality date predicate. The general window-functions practice page and the date-arithmetic practice page cover both.

How many Lyft practice problems should I solve before the interview?

Solve all 6 problems on the Lyft practice page end-to-end—untimed first, then timed at 25 minutes per problem. After that, broaden to 30 to 40 additional problems spread across the matching topic pages: queue and design, hash table, two pointers, binary search, time series, window functions, date arithmetic. The Lyft Python practice page and Lyft SQL practice page are the right surfaces for the final week of prep before the loop.

Top Lyft data engineering interview topics

1. Queue Design and Multi-Stream Merge in Python

Heap-backed queues for multi-stream merge in Python for data engineering

heapq as a min-heap priority queue

k-way merge: pull cheapest head, push next from same stream

Stream exhaustion: pop without re-pushing when iterator drains

Python Interview Question on Multi-Stream Reader

Solution Using heapq k-way merge

2. Hash Tables for Text Search Autocomplete in Python

Prefix-bucket hash tables for text search autocomplete in Python for data engineering

defaultdict(list) as a prefix → matches map

Build phase: index every prefix of every term

Query phase: O(1) lookup + bounded result trim

Python Interview Question on Text Search Autocomplete

Solution Using defaultdict(list) prefix bucket

3. Hash Table Plus Two Pointers for Array Intersection in Python

Frequency dicts and two pointers for array intersection without set() in Python

Frequency dict for the smaller array

Two-pointer alternative when both arrays are sorted

Multi-set semantics: decrement on match to honor counts

Python Interview Question on Array Intersection Without Set

Solution Using a frequency dict and a single scan

4. Binary Search for the Nth Missing Integer in Python

Binary search on the count-of-missing invariant in Python for data engineering

Missing-count invariant: missing(i) = arr[i] - (i + 1)

Binary search on the predicate missing(mid) >= n

Recover the answer: n + lo after the loop

Python Interview Question on the Nth Missing Integer

Solution Using lower-bound binary search on the missing-count invariant

5. SQL Time-Series Aggregation for New Customers Per Day

First-touch time-series aggregation in SQL for data engineering

First-touch per user: MIN(order_date) GROUP BY user_id

Daily roll-up: COUNT(*) GROUP BY first_date

Subquery / CTE composition: feed first-touch into a daily count

SQL Interview Question on New Customers Per Day

Solution Using a first-touch CTE and a daily count

6. SQL Window Functions for Consecutive-Day Buyer Cohorts

LAG-driven window functions for consecutive-day patterns in SQL for data engineering

DISTINCT (user_id, purchase_date) to collapse intra-day duplicates

LAG(purchase_date) OVER (PARTITION BY user_id ORDER BY purchase_date)

Consecutive predicate: purchase_date = prev_date + INTERVAL '1 day'

SQL Interview Question on Customers Who Bought on 2 Consecutive Days

Solution Using LAG with a per-user partition

Tips to Crack Lyft Data Engineering Interviews

Drill the four Lyft Python primitives, in order

Treat SQL as a two-step composition

Easy-tier discipline matters more than medium-tier intuition

Mobility-data framing on every prompt

State the invariant out loud, then write code

Where to practice on PipeCode

Communication and approach under time pressure

Frequently Asked Questions

What is the Lyft data engineering interview process like?

What Python topics does Lyft test for data engineers?

How important is SQL for a Lyft data engineering interview?

How hard are Lyft data engineering interview questions?

Are window functions and consecutive-day patterns common in Lyft interviews?

How many Lyft practice problems should I solve before the interview?

Start practicing Lyft data engineering problems

`heapq` as a min-heap priority queue

Solution Using `heapq` k-way merge

`defaultdict(list)` as a prefix → matches map

Solution Using `defaultdict(list)` prefix bucket

Frequency dicts and two pointers for array intersection without `set()` in Python

Missing-count invariant: `missing(i) = arr[i] - (i + 1)`

Binary search on the predicate `missing(mid) >= n`

Recover the answer: `n + lo` after the loop

First-touch per user: `MIN(order_date) GROUP BY user_id`

Daily roll-up: `COUNT(*) GROUP BY first_date`

`LAG`-driven window functions for consecutive-day patterns in SQL for data engineering

`DISTINCT (user_id, purchase_date)` to collapse intra-day duplicates

`LAG(purchase_date) OVER (PARTITION BY user_id ORDER BY purchase_date)`

Consecutive predicate: `purchase_date = prev_date + INTERVAL '1 day'`

Solution Using `LAG` with a per-user partition