DEV Community

Cover image for Lyft Data Engineering Interview Questions
Gowtham Potureddi
Gowtham Potureddi

Posted on

Lyft Data Engineering Interview Questions

Lyft data engineering interview questions lean Python-fundamentals first with a focused SQL analytics finish: four Python primitives (heap-backed multi-stream merge for ordered event readers, prefix hash buckets for text autocomplete, hash-table plus two-pointer array intersection without set(), and binary search on the arr[i] - (i + 1) invariant for Nth-missing-integer counting), plus two SQL primitives that test ride-share analytics fluency (first-touch time-series aggregation for new-customers-per-day reports and LAG-driven window functions for consecutive-day buyer cohorts). The framings are mobility data—rider event streams, place autocomplete, route-segment intersection, sparse trip-id arrays, daily new-rider counts, and two-day return retention.

This guide walks through the six topic clusters Lyft actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 6-problem Lyft set (3 easy, 3 medium, no hard)—a fundamentals-heavy hub that rewards Python data-structure fluency, the binary-search invariant trick, and clean ride-share SQL over algorithm-puzzle prep.

Lyft data engineering interview questions cover image with bold headline, Python and SQL chips, and pipecode.ai attribution.


Top Lyft data engineering interview topics

From the Lyft data engineering practice set, the six numbered sections below follow this topic map (one row per H2):

# Topic (sections 1–6) Why it shows up at Lyft
1 Queue design and multi-stream merge in Python Multi Stream Reader—heapq min-heap k-way merge over K sorted iterators.
2 Hash tables for text search autocomplete in Python Text Search Auto-Complete—defaultdict(list) prefix bucket lookup for low-latency place suggestions.
3 Hash table plus two pointers for array intersection in Python Array Intersection Without Set—Counter-style frequency dict with multi-set semantics, no set() allowed.
4 Binary search for the Nth missing integer in Python Nth Missing Integer—arr[i] - (i + 1) invariant turns Nth-missing-positive into a clean monotone predicate.
5 SQL time-series aggregation for new customers per day New Customers Per Day—MIN(order_date) GROUP BY user_id first-touch CTE, then COUNT per first-touch date.
6 SQL window functions for consecutive-day buyer cohorts Customers Who Bought on 2 Consecutive Days—LAG(purchase_date) over a user-partitioned ordered window.

Mobility-data framing rule: Lyft's prompts span ride-share analytics—rider events streamed in time order, place-name autocomplete, route-segment intersection, trip-id sparse arrays, daily new riders, two-day return retention. The interviewer is grading whether you map each business framing to the right primitive: ordered multi-stream → min-heap; prefix lookup → hash bucket; intersection without set() → frequency dict + two-pointer; Nth missing → binary search on the count-of-missing invariant; new-customers-per-day → first-touch + daily count; consecutive-day buyers → LAG partitioned by user. State the mapping out loud.


1. Queue Design and Multi-Stream Merge in Python

Heap-backed queues for multi-stream merge in Python for data engineering

"Read K sorted event streams and emit them as one merged sorted stream" is the canonical queue-design interview prompt at Lyft. The mental model: a min-heap holds the current smallest unconsumed element from each stream; pop the global minimum, write it out, then push the next element from the same stream. This is the standard k-way merge—the same pattern that powers external sort, streaming analytics over partitioned topics, and any system that joins ordered event feeds (rider GPS pings, driver location updates, trip-state transitions) on event-time.

Pro tip: Python's heapq operates on the lowest lexicographic tuple. Push (value, stream_id, item) so ties on value resolve deterministically by stream_id—never push (value, item) where item is non-comparable (a dict, an object without __lt__), or your heap will raise TypeError mid-merge.

heapq as a min-heap priority queue

heapq is a heap-ordered list with O(log n) push and pop. The invariant: the smallest element is always at index 0. There is no Heap object—the module provides functions that operate on a regular list, treating it as a binary heap rooted at index 0.

  • heapq.heappush(h, x)O(log n) insert; restores heap order after appending.
  • heapq.heappop(h)O(log n) remove the smallest; bubbles the last element down.
  • heapq.heapify(lst)O(n) in-place heapify of an existing list.
  • Min-heap only — for a max-heap, push the negation (-x) or wrap with a comparator class.

Worked example. Push the values 5, 2, 8, 1, 3 onto an empty heap one by one and trace the heap after each insertion.

step operation heap state min?
1 push 5 [5] 5
2 push 2 [2, 5] 2
3 push 8 [2, 5, 8] 2
4 push 1 [1, 2, 8, 5] 1
5 push 3 [1, 2, 8, 5, 3] 1

Worked-example solution.

import heapq

heap = []
for x in [5, 2, 8, 1, 3]:
    heapq.heappush(heap, x)
    print(heap, "min =", heap[0])

# pop them in sorted order
sorted_out = []
while heap:
    sorted_out.append(heapq.heappop(heap))
print(sorted_out)  # [1, 2, 3, 5, 8]
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if the prompt mentions "smallest of many," "k-way merge," or "top-K," reach for heapq before you reach for sorted().

k-way merge: pull cheapest head, push next from same stream

Given K sorted iterators, the merge invariant is always emit the global minimum head; never advance more than one stream per step. Seed the heap with the first element from each stream tagged with its stream_id. Pop the min (value, stream_id) and write value to the output. Then pull the next element from streams[stream_id] and push it onto the heap. Repeat until the heap is empty.

  • Tag with stream_id so on a tie the heap can break it deterministically and so the popped value tells you which stream to advance.
  • One pop, one push keeps the heap size at exactly K (one head per active stream).
  • Output order is globally sorted because at every step the heap contains every active stream's smallest unconsumed value, so heap[0] is the global minimum.
  • Total costn pops + n pushes over a heap of size at most K → O(n log K) for n total elements across K streams.

Worked example. Three sorted streams: s0 = [1, 4, 7], s1 = [2, 5, 8], s2 = [3, 6, 9]. Trace the heap and the output for the first four steps.

step heap before pop popped next pushed output so far
1 [(1,s0), (2,s1), (3,s2)] (1,s0) (4,s0) [1]
2 [(2,s1), (4,s0), (3,s2)] (2,s1) (5,s1) [1, 2]
3 [(3,s2), (4,s0), (5,s1)] (3,s2) (6,s2) [1, 2, 3]
4 [(4,s0), (5,s1), (6,s2)] (4,s0) (7,s0) [1, 2, 3, 4]

Worked-example solution.

import heapq

def k_way_merge(streams):
    heap = []
    iters = [iter(s) for s in streams]
    for i, it in enumerate(iters):
        try:
            heapq.heappush(heap, (next(it), i))
        except StopIteration:
            pass
    while heap:
        val, i = heapq.heappop(heap)
        yield val
        try:
            heapq.heappush(heap, (next(iters[i]), i))
        except StopIteration:
            pass

list(k_way_merge([[1, 4, 7], [2, 5, 8], [3, 6, 9]]))
# [1, 2, 3, 4, 5, 6, 7, 8, 9]
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: never carry the entire materialized list into the heap; only one head per stream lives in the heap at a time.

Stream exhaustion: pop without re-pushing when iterator drains

A stream is exhausted when its iterator raises StopIteration on next(it). The invariant: on exhaustion, pop without pushing back. The heap then shrinks by one and the merge continues over the remaining K-1 streams.

  • Use try / except StopIteration around next(it) to detect end of stream.
  • Heap size monotonically shrinks once any stream exhausts—do not refill from a different stream.
  • Tail-merge naturally degenerates — when only one stream remains, the heap holds exactly one element each iteration; the merge keeps emitting in O(log 1) = O(1) per element.

Worked example. Two streams: s0 = [1, 5] (exhausts first), s1 = [2, 3, 4, 6]. After s0 is drained, the heap holds only s1's heads.

step heap popped s0 next s1 next output
1 [(1,s0), (2,s1)] (1,s0) 5 [1]
2 [(2,s1), (5,s0)] (2,s1) 3 [1, 2]
3 [(3,s1), (5,s0)] (3,s1) 4 [1, 2, 3]
4 [(4,s1), (5,s0)] (4,s1) 6 [1, 2, 3, 4]
5 [(5,s0), (6,s1)] (5,s0) StopIteration [1, 2, 3, 4, 5]
6 [(6,s1)] (6,s1) StopIteration [1, 2, 3, 4, 5, 6]

Worked-example solution.

import heapq

def merge_two(s0, s1):
    iters = {0: iter(s0), 1: iter(s1)}
    heap = []
    for k, it in iters.items():
        try:
            heapq.heappush(heap, (next(it), k))
        except StopIteration:
            pass
    while heap:
        v, k = heapq.heappop(heap)
        yield v
        try:
            heapq.heappush(heap, (next(iters[k]), k))
        except StopIteration:
            pass  # this stream is done — do not push back

list(merge_two([1, 5], [2, 3, 4, 6]))
# [1, 2, 3, 4, 5, 6]
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: the merge ends exactly when the heap is empty—not when all iterators raise StopIteration once. The heap drains on its own.

Common beginner mistakes

  • Pushing the whole stream into the heap up front—O(N) memory instead of O(K).
  • Forgetting the stream_id tag and pushing only value—then on ties you can't tell which stream to advance.
  • Pushing a non-comparable payload (dict, custom object without __lt__)—heapq raises TypeError mid-merge on tied values.
  • Re-pushing from a freshly exhausted stream every iteration—silent infinite loop or StopIteration thrown to the caller.
  • Treating heap[0] as a stable pointer across pushes—it changes; use heappop to consume.

Python Interview Question on Multi-Stream Reader

Given K sorted iterables of integers, return a single iterator that yields all values in globally sorted order. Each input iterable is internally sorted, may be of different length, and may be empty. The output must be lazy (one value at a time) so the caller can stream large inputs without materializing them.

Solution Using heapq k-way merge

import heapq

def multi_stream_reader(streams):
    """Lazy k-way merge of K sorted iterables into one sorted stream."""
    iters = [iter(s) for s in streams]
    heap = []
    for i, it in enumerate(iters):
        try:
            heapq.heappush(heap, (next(it), i))
        except StopIteration:
            pass
    while heap:
        val, i = heapq.heappop(heap)
        yield val
        try:
            heapq.heappush(heap, (next(iters[i]), i))
        except StopIteration:
            pass
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace (input: streams = [[1, 4, 7], [2, 5, 8], [3, 6, 9]]):

stream_id values
0 1, 4, 7
1 2, 5, 8
2 3, 6, 9
  1. Seed the heap — pull the head from each stream: push (1, 0), (2, 1), (3, 2). Heap now [(1,0), (2,1), (3,2)].
  2. Pop (1, 0) — yield 1. Pull next from stream 0 → 4. Push (4, 0). Heap [(2,1), (4,0), (3,2)].
  3. Pop (2, 1) — yield 2. Pull next from stream 1 → 5. Push (5, 1). Heap [(3,2), (4,0), (5,1)].
  4. Pop (3, 2) — yield 3. Pull next from stream 2 → 6. Push (6, 2). Heap [(4,0), (5,1), (6,2)].
  5. Pop (4, 0) — yield 4. Pull next from stream 0 → 7. Push (7, 0). Heap [(5,1), (7,0), (6,2)].
  6. Pop (5, 1) — yield 5. Pull next from stream 1 → 8. Push (8, 1). Heap [(6,2), (7,0), (8,1)].
  7. Pop (6, 2) — yield 6. Pull next from stream 2 → 9. Push (9, 2). Heap [(7,0), (8,1), (9,2)].
  8. Final three pops drain heap — yield 7, then 8, then 9. After popping (9, 2) and pulling next(iters[2]) raises StopIteration, the heap is empty and the generator returns.

Output:

index yielded
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9

Why this works — concept by concept:

  • Min-heap invariantheap[0] is always the global minimum of the active stream heads, so the very next yield is always the correct globally smallest unconsumed value.
  • Stream-id tagging — pushing (value, stream_id) ensures ties break deterministically and tells the loop which stream's iterator to advance after each pop.
  • One-head-per-stream rule — the heap holds at most K elements; we never materialize the full input, keeping memory at O(K) regardless of total size.
  • Lazy generatoryield makes the merge producer-friendly: the caller pulls one element at a time, supporting unbounded or very large inputs.
  • Exhaustion handlingtry/except StopIteration around next(iters[i]) shrinks the heap on drain; no special end-of-stream sentinel is needed.
  • Costn pops + at most n pushes on a heap bounded by K → O(n log K) time, O(K) space, where n is total elements across all streams.

Diagram showing three vertical event streams feeding a min-heap, with the smallest head being popped to a merged output and the next head from the same stream being pushed back.

PYTHON
Topic — queue
Queue and priority-queue problems

Practice →

PYTHON
Topic — design
Design problems

Practice →


2. Hash Tables for Text Search Autocomplete in Python

Prefix-bucket hash tables for text search autocomplete in Python for data engineering

"Build an autocomplete that returns matching completions for a given prefix" is the canonical hash-table interview prompt for Lyft's place-search and rider-input flows. The mental model: at index time, for every term store it under all of its prefixes; at query time, the prefix is the dictionary key and the value is the list of completions. This trades index-time memory for query-time speed: every query becomes one O(1) dict lookup followed by a bounded slice. The trade-off is exactly what real-time UIs need—keystroke-by-keystroke suggestions can't afford a linear scan of millions of place names.

Pro tip: When the corpus is huge or storing every prefix blows memory, swap to a trie instead of a prefix-keyed dict. The dict approach is O(L²) total prefixes per term of length L; a trie shares prefix paths so storage drops to O(unique nodes). For interview-tier prompts, the prefix-dict version is the expected baseline answer—mention the trie as a follow-up optimization.

defaultdict(list) as a prefix → matches map

collections.defaultdict(list) is the canonical container for "many values per key." The invariant: a key access creates an empty list if missing, so you can append without writing if key not in d: d[key] = []. The structure: prefix -> list of full terms that start with that prefix.

  • from collections import defaultdict — the import.
  • buckets = defaultdict(list) — declare with the value-factory list.
  • buckets[prefix].append(term) — append-or-create in one line.
  • No KeyError on readbuckets["unseen"] returns [] and inserts the empty list as a side effect; use .get(prefix, []) if you don't want the side effect.

Worked example. Index three terms—"airport", "airline", "alpha"—under their length-2 prefixes only.

term length-2 prefix bucket after append
"airport" "ai" {"ai": ["airport"]}
"airline" "ai" {"ai": ["airport", "airline"]}
"alpha" "al" {"ai": ["airport", "airline"], "al": ["alpha"]}

Worked-example solution.

from collections import defaultdict

buckets = defaultdict(list)
for term in ["airport", "airline", "alpha"]:
    buckets[term[:2]].append(term)

print(dict(buckets))
# {'ai': ['airport', 'airline'], 'al': ['alpha']}
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: whenever the question is "given a key, give me a list of things," defaultdict(list) is the right primitive; reach for it before dict.setdefault.

Build phase: index every prefix of every term

The autocomplete invariant: every prefix of every term must point to the term. For a term of length L, generate prefixes of length 1, 2, …, L and append the term under each. Build cost is O(sum of L²) across the corpus; query cost drops to O(1) for the lookup plus O(k) for slicing the top k results.

  • Loop for i in range(1, len(term) + 1) — generate prefixes of every length from 1 up to the full term.
  • buckets[term[:i]].append(term) — append the term under each prefix.
  • Sort each bucket by frequency or rank if the prompt asks for "best matches first."
  • Memory trade-off — terms appear in L buckets; total entries ≈ sum(L) * average(L) ≈ N * Lavg.

Worked example. Index "taxi" and "tax" under all prefixes.

term prefixes bucket appends
"tax" "t", "ta", "tax" {"t":["tax"], "ta":["tax"], "tax":["tax"]}
"taxi" "t", "ta", "tax", "taxi" {"t":["tax","taxi"], "ta":["tax","taxi"], "tax":["tax","taxi"], "taxi":["taxi"]}

Worked-example solution.

from collections import defaultdict

def build_index(terms):
    buckets = defaultdict(list)
    for term in terms:
        for i in range(1, len(term) + 1):
            buckets[term[:i]].append(term)
    return buckets

idx = build_index(["tax", "taxi"])
print(idx["ta"])  # ['tax', 'taxi']
print(idx["taxi"])  # ['taxi']
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if the same term appears in multiple buckets (which it does, by design), accept the redundancy—query speed pays for it.

Query phase: O(1) lookup + bounded result trim

The query invariant: a prefix lookup is one dict access plus a slice. Return the first k items of buckets[prefix]—or all of them if the bucket is short. Adding a top-k limit keeps response size predictable for UI rendering.

  • buckets.get(prefix, []) — safe lookup; returns [] for unseen prefixes.
  • results[:k] — bound the response; use heapq.nsmallest(k, ..., key=...) if you need top-k by rank rather than insertion order.
  • Sort by frequency at index time once, not per query, if rank matters.
  • Empty-prefix guard — return [] for ""; never hand the caller "all terms."

Worked example. With the index from the previous example, query "ta" and "z" with k = 5.

prefix bucket top-5 result
"ta" ["tax", "taxi"] ["tax", "taxi"]
"z" (missing) []

Worked-example solution.

def autocomplete(idx, prefix, k=5):
    if not prefix:
        return []
    return idx.get(prefix, [])[:k]

print(autocomplete(idx, "ta", k=5))  # ['tax', 'taxi']
print(autocomplete(idx, "z", k=5))   # []
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: a one-liner with .get(prefix, [])[:k] covers the empty-bucket and bounded-output cases together.

Common beginner mistakes

  • Using dict[key].append(...) without defaultdictKeyError on the first append per key.
  • Indexing only the full term (buckets[term].append(term))—prefix queries return nothing.
  • Storing the same term twice in a bucket because the input list has duplicates and you didn't dedupe.
  • Slicing with result[:k] but forgetting to sort first when the prompt says "top-k by frequency."
  • Returning the full corpus on empty prefix—always guard "".

Python Interview Question on Text Search Autocomplete

Build a class Autocomplete that takes a list of unique strings at construction time and supports a method query(prefix, k=5) that returns the first k strings (insertion order is fine) that start with prefix. Empty prefix returns an empty list. Unknown prefix returns an empty list.

Solution Using defaultdict(list) prefix bucket

from collections import defaultdict

class Autocomplete:
    def __init__(self, terms):
        self.buckets = defaultdict(list)
        for term in terms:
            for i in range(1, len(term) + 1):
                self.buckets[term[:i]].append(term)

    def query(self, prefix, k=5):
        if not prefix:
            return []
        return self.buckets.get(prefix, [])[:k]
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace (input: terms = ["airport", "airline", "alpha"], then query("ai", k=5), query("al", k=2), query("", k=5)):

term prefixes generated
"airport" "a", "ai", "air", "airp", "airpo", "airpor", "airport"
"airline" "a", "ai", "air", "airl", "airli", "airlin", "airline"
"alpha" "a", "al", "alp", "alph", "alpha"
  1. Build phase, term "airport" — for i = 1..7, append "airport" to buckets "a", "ai", "air", "airp", "airpo", "airpor", "airport".
  2. Build phase, term "airline" — for i = 1..7, append "airline" to buckets "a", "ai", "air", "airl", "airli", "airlin", "airline". Bucket "a" now ["airport", "airline"]; bucket "ai" now ["airport", "airline"].
  3. Build phase, term "alpha" — for i = 1..5, append "alpha" to buckets "a", "al", "alp", "alph", "alpha". Bucket "a" now ["airport", "airline", "alpha"].
  4. query("ai", k=5) — non-empty prefix → buckets.get("ai", []) returns ["airport", "airline"]; slice [:5] returns ["airport", "airline"].
  5. query("al", k=2) — non-empty prefix → buckets.get("al", []) returns ["alpha"]; slice [:2] returns ["alpha"].
  6. query("", k=5) — empty prefix guard fires; return [] without touching buckets.

Output:

query result
query("ai", 5) ["airport", "airline"]
query("al", 2) ["alpha"]
query("", 5) []

Why this works — concept by concept:

  • defaultdict(list) — every prefix slot starts as [] automatically; we append-or-create in a single line without KeyError guards.
  • All-prefix indexing — for each term of length L, we register it under every prefix of lengths 1..L, so any partial keystroke is one dict lookup away from the matching terms.
  • Insertion-order list — Python lists preserve the order terms were appended, giving stable, deterministic results without a sort step.
  • .get(prefix, []) — safe fallback; an unknown prefix returns the empty list rather than raising or creating a phantom bucket via defaultdict autovivification.
  • Top-k slice[:k] bounds the response cheaply; on small per-bucket lists it's effectively O(k).
  • Empty-prefix guard — short-circuits before any lookup so an empty input cannot accidentally return the entire corpus.
  • Cost — build is O(sum(L_i))O(N · L_avg) time and space (each term appears in L_i buckets); query is O(1) lookup + O(k) slice.

PYTHON
Topic — hash table
Hash table problems

Practice →

PYTHON
Topic — string
String problems

Practice →


3. Hash Table Plus Two Pointers for Array Intersection in Python

Frequency dicts and two pointers for array intersection without set() in Python

"Find the intersection of two arrays without using set()" is a deliberate Lyft constraint that forces candidates off the auto-pilot one-liner and into the underlying primitive. The mental model: intersection = elements that appear in both arrays, with multiplicity respected. Two clean approaches: a frequency dictionary built on the smaller array (then scanned against the larger), or a two-pointer walk if both arrays are already sorted. The frequency-dict version is O(n + m); the two-pointer version is O(n + m) after a sort, with O(1) extra space.

Pro tip: "Without set()" usually means the interviewer wants to see you understand multi-set semantics. If a = [1, 1, 2] and b = [1, 2, 2], the answer is [1, 2], not [1, 1, 2, 2]—each match consumes one copy from each side. Decrement on match.

Frequency dict for the smaller array

The frequency-dict invariant: count occurrences in the smaller side, then walk the larger side and decrement. The smaller-side dict bounds memory; scanning the larger array is one pass. Use collections.Counter or a plain dict with .get(x, 0) + 1.

  • Counter(a) — one-line frequency dict; values are integer counts.
  • if counts[x] > 0 before appending—prevents over-counting once the dict's tally drops to zero.
  • counts[x] -= 1 on each match—decrement so duplicates don't all match the same single occurrence.
  • Smaller-side bias — count the shorter array to keep the dict small.

Worked example. a = [1, 2, 2, 3] (smaller), b = [2, 2, 4, 1] (larger). Build counts from a, then walk b.

step x counts before match? counts after result
0 {1:1, 2:2, 3:1} []
1 2 {1:1, 2:2, 3:1} yes (2 > 0) {1:1, 2:1, 3:1} [2]
2 2 {1:1, 2:1, 3:1} yes (1 > 0) {1:1, 2:0, 3:1} [2, 2]
3 4 {1:1, 2:0, 3:1} no (get(4, 0) = 0) unchanged [2, 2]
4 1 {1:1, 2:0, 3:1} yes (1 > 0) {1:0, 2:0, 3:1} [2, 2, 1]

Worked-example solution.

from collections import Counter

def intersect_freq(a, b):
    if len(a) > len(b):
        a, b = b, a  # iterate the smaller
    counts = Counter(a)
    out = []
    for x in b:
        if counts.get(x, 0) > 0:
            out.append(x)
            counts[x] -= 1
    return out

intersect_freq([1, 2, 2, 3], [2, 2, 4, 1])  # [2, 2, 1]
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: Counter is the right primitive for multi-set intersection—set discards multiplicity by definition.

Two-pointer alternative when both arrays are sorted

When both inputs are already sorted (or you can afford to sort them), the two-pointer walk uses O(1) extra space. Maintain pointers i and j; at each step compare a[i] and b[j]. On equal, append and advance both. Otherwise, advance the side with the smaller value.

  • Sort first if not already sorted—O((n + m) log(n + m)); otherwise O(n + m).
  • if a[i] == b[j] — match; append once, advance both.
  • elif a[i] < b[j] — advance i.
  • else — advance j.
  • Multi-set semantics are automatic: advancing both on match consumes one copy from each side.

Worked example. a = [1, 2, 2, 3], b = [1, 2, 2, 4]—both sorted. Trace (i, j) after each step.

step i j a[i] b[j] action result
0 0 0 1 1 match → i+=1, j+=1 [1]
1 1 1 2 2 match → i+=1, j+=1 [1, 2]
2 2 2 2 2 match → i+=1, j+=1 [1, 2, 2]
3 3 3 3 4 a < b → i+=1 [1, 2, 2]
4 4 3 i out of bounds → stop [1, 2, 2]

Worked-example solution.

def intersect_two_pointers(a, b):
    a = sorted(a)  # noop if already sorted; cheap to be safe
    b = sorted(b)
    i = j = 0
    out = []
    while i < len(a) and j < len(b):
        if a[i] == b[j]:
            out.append(a[i])
            i += 1
            j += 1
        elif a[i] < b[j]:
            i += 1
        else:
            j += 1
    return out

intersect_two_pointers([1, 2, 2, 3], [1, 2, 2, 4])  # [1, 2, 2]
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if both inputs are guaranteed sorted, two-pointer beats the frequency dict on memory; if one side is much bigger and unsorted, the frequency-dict version is simpler.

Multi-set semantics: decrement on match to honor counts

The multi-set invariant: each element in a can match at most one element in b, and vice versa. Without decrementing, a = [1, 1] and b = [1, 1, 1] would emit [1, 1, 1]—wrong. Decrementing the dict (or advancing both pointers) caps the result at min(count_a, count_b) per value.

  • Frequency dict — decrement counts[x] on each emit.
  • Two-pointer — advance both i and j on match.
  • set intersection — collapses both sides to distinct values; loses multiplicity.
  • Test the boundary — if both arrays have wildly different counts of the same value, multi-set vs. set behavior differs visibly.

Worked example. a = [1, 1, 2], b = [1, 2, 2].

approach output
frequency dict (decrementing) [1, 2]
two-pointer [1, 2]
set(a) & set(b) (for contrast) {1, 2} (no multiplicity)
naïve "if x in b" without decrement [1, 1, 2] (wrong)

Worked-example solution.

from collections import Counter

def intersect_multiset(a, b):
    ca, cb = Counter(a), Counter(b)
    out = []
    for x, n in ca.items():
        if x in cb:
            out.extend([x] * min(n, cb[x]))
    return out

intersect_multiset([1, 1, 2], [1, 2, 2])  # [1, 2]
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if the prompt says "intersection," default to multi-set semantics; ask the interviewer to confirm only if the test cases hint at set semantics.

Common beginner mistakes

  • Using set(a) & set(b) and calling it done—loses multiplicity and ignores the "no set()" constraint.
  • Forgetting to decrement on match—each a element matches every duplicate in b.
  • Iterating the larger array against the smaller dict—wastes the bias; iterate the larger array but keep the dict small.
  • Returning out.append(x) for every appearance in b regardless of counts[x]—over-emits when b has more copies than a.
  • Sorting copies in place with a.sort()—mutates the caller's input; use sorted(a) instead.

Python Interview Question on Array Intersection Without Set

Given two integer arrays a and b, return their intersection as a list, without using set or frozenset. Each element should appear in the result the minimum number of times it appears in either array. Order does not matter, but the multiplicity must be exact.

Solution Using a frequency dict and a single scan

from collections import Counter

def array_intersection(a, b):
    if len(a) > len(b):
        a, b = b, a  # bias: build the dict on the smaller array
    counts = Counter(a)
    out = []
    for x in b:
        if counts.get(x, 0) > 0:
            out.append(x)
            counts[x] -= 1
    return out
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace (input: a = [1, 2, 2, 3, 4], b = [2, 2, 4, 4, 5]):

index b[index] counts before matches? counts after output
{1:1, 2:2, 3:1, 4:1} []
0 2 {1:1, 2:2, 3:1, 4:1} yes {1:1, 2:1, 3:1, 4:1} [2]
1 2 {1:1, 2:1, 3:1, 4:1} yes {1:1, 2:0, 3:1, 4:1} [2, 2]
2 4 {1:1, 2:0, 3:1, 4:1} yes {1:1, 2:0, 3:1, 4:0} [2, 2, 4]
3 4 {1:1, 2:0, 3:1, 4:0} no (counts[4] = 0) unchanged [2, 2, 4]
4 5 {1:1, 2:0, 3:1, 4:0} no (get(5, 0) = 0) unchanged [2, 2, 4]
  1. Bias steplen(a) = 5, len(b) = 5; arrays are equal length so no swap. Counter built from a: {1:1, 2:2, 3:1, 4:1}.
  2. b[0] = 2counts[2] = 2 > 0, append 2, decrement to 1.
  3. b[1] = 2counts[2] = 1 > 0, append 2, decrement to 0.
  4. b[2] = 4counts[4] = 1 > 0, append 4, decrement to 0.
  5. b[3] = 4counts[4] = 0, skip; the third 4 in b doesn't match because a only has one 4.
  6. b[4] = 5counts.get(5, 0) = 0, skip.

Output:

index value
0 2
1 2
2 4

Why this works — concept by concept:

  • Smaller-side dict — building the Counter on the shorter input bounds memory at O(min(n, m)); we still see every element of the larger side in one linear scan.
  • Counter.get(x, 0) — handles "value not in a" without KeyError; 0 correctly signals "no remaining copies to match."
  • Decrement-on-match — capping the contribution per value at counts[x] enforces multi-set semantics: each match consumes one copy from each side.
  • Single-pass over b — exactly one walk over the larger array; no nested scans, no O(n·m) pitfalls.
  • Order preservation — output reflects encounter order in b; cheap, deterministic, and avoids a sort step.
  • No set() dependency — the constraint is honored via the dict; the algorithm doesn't rely on hash-set membership semantics.
  • CostO(n + m) time, O(min(n, m)) extra space.

PYTHON
Lyft — hash table
Lyft hash-table problems

Practice →

PYTHON
Topic — two pointers
Two-pointer problems

Practice →


4. Binary Search for the Nth Missing Integer in Python

Binary search on the count-of-missing invariant in Python for data engineering

"Given a strictly increasing array of positive integers, find the Nth missing positive integer relative to a contiguous [1, 2, 3, …] baseline" looks like a counting puzzle until you spot the invariant: at index i, the number of missing positives below arr[i] is exactly arr[i] - (i + 1) (assuming 0-based indexing). That count is monotonically non-decreasing as i advances—so binary search on the predicate missing(i) >= n finds the boundary in O(log L) time over the array length.

Pro tip: This problem trips candidates who reach for a linear scan first—O(L)—or expand the array—O(arr[-1]). State the invariant out loud, derive the formula, then announce the binary-search reduction. Interviewers value the explicit naming of the invariant more than they value the optimal complexity.

Missing-count invariant: missing(i) = arr[i] - (i + 1)

The invariant is the heart of the problem. If the array were exactly [1, 2, 3, …], then arr[i] = i + 1, so arr[i] - (i + 1) = 0—zero missing values. Every gap in the array bumps arr[i] past i + 1, and the gap size accumulates: arr[i] - (i + 1) is the count of positive integers from 1 to arr[i] - 1 that are not in the array.

  • Baseline — for the contiguous array [1, 2, 3, ..., k], arr[i] - (i + 1) = 0 everywhere.
  • Single gap — drop 3 from [1, 2, 4, 5]; at i = 2, arr[2] = 4, (i + 1) = 3, missing = 1.
  • Multiple gaps accumulate — at every index, the count is the total missing strictly below arr[i].
  • Strictly non-decreasing — because the array is strictly increasing, missing(i) never goes down as i advances.

Worked example. arr = [2, 3, 5, 9]. Compute missing(i) at every index.

i arr[i] (i + 1) missing(i) what's missing below arr[i]
0 2 1 1 {1}
1 3 2 1 {1} (4 not yet seen)
2 5 3 2 {1, 4}
3 9 4 5 {1, 4, 6, 7, 8}

Worked-example solution.

def missing_at(arr, i):
    return arr[i] - (i + 1)

arr = [2, 3, 5, 9]
for i in range(len(arr)):
    print(f"i={i}, arr[i]={arr[i]}, missing={missing_at(arr, i)}")
# i=0, arr[i]=2, missing=1
# i=1, arr[i]=3, missing=1
# i=2, arr[i]=5, missing=2
# i=3, arr[i]=9, missing=5
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: when "Nth missing" appears with a sorted array, write arr[i] - (i + 1) on the whiteboard before writing any code.

Binary search on the predicate missing(mid) >= n

The predicate missing(i) >= n is monotone non-decreasing: once true, it stays true. That's the binary-search precondition. Search for the smallest index where the predicate first becomes true; the Nth missing value lives in the gap immediately before that index.

  • Lower bound searchlo = 0, hi = len(arr) (one past the last index).
  • Loop conditionwhile lo < hi:.
  • Midmid = (lo + hi) // 2.
  • Branch — if missing(mid) < n, the boundary is to the right; lo = mid + 1. Else hi = mid.
  • Terminationlo == hi is the smallest index with missing(i) >= n.

Worked example. arr = [2, 3, 5, 9], n = 3. Trace the search.

step lo hi mid arr[mid] missing(mid) branch new lo, hi
0 0 4 2 5 2 2 < 3 → right lo=3, hi=4
1 3 4 3 9 5 5 >= 3 → left lo=3, hi=3
2 3 3 exit lo = 3

The first index where missing(i) >= 3 is i = 3.

Worked-example solution.

def find_boundary(arr, n):
    lo, hi = 0, len(arr)
    while lo < hi:
        mid = (lo + hi) // 2
        if arr[mid] - (mid + 1) < n:
            lo = mid + 1
        else:
            hi = mid
    return lo

find_boundary([2, 3, 5, 9], 3)  # 3
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: lower-bound binary search on a monotone predicate is the workhorse pattern—reach for it whenever "find the first index where … is true."

Recover the answer: n + lo after the loop

Once the loop returns the boundary index lo, the Nth missing value falls in the gap just before arr[lo] (or after the array end if lo == len(arr)). The closed-form recovery is answer = n + lo: the n accounts for "Nth missing" and lo accounts for the count of elements still in the array below the answer.

  • Why n + lo — there are lo array elements at indices 0..lo-1; their values are at least 1, 2, …, lo. The Nth missing positive must come after them, offset by the missing count.
  • Edge: lo == 0 — no array elements below the answer; the answer is n directly.
  • Edge: lo == len(arr) — the answer is past the array end; n + len(arr) is correct because every array element is "below" the answer and consumes a slot.
  • Strict-increasing assumption — the formula breaks on duplicate or non-sorted inputs.

Worked example. Continuing the search above, lo = 3, n = 3, so answer = 3 + 3 = 6. Verify: missing values are {1, 4, 6, 7, 8}, sorted; the third is 6. ✓

n lo n + lo the missing values listed
1 0 1 {1, 4, 6, 7, 8} → 1st is 1
2 2 4 2nd is 4
3 3 6 3rd is 6
4 3 7 4th is 7
5 3 8 5th is 8

Worked-example solution.

def nth_missing(arr, n):
    lo, hi = 0, len(arr)
    while lo < hi:
        mid = (lo + hi) // 2
        if arr[mid] - (mid + 1) < n:
            lo = mid + 1
        else:
            hi = mid
    return n + lo

print(nth_missing([2, 3, 5, 9], 1))  # 1
print(nth_missing([2, 3, 5, 9], 3))  # 6
print(nth_missing([2, 3, 5, 9], 5))  # 8
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: "n + lo" is one of those formulas that looks like magic until you've derived it once; do it on the whiteboard, then never forget it.

Common beginner mistakes

  • Doing a linear scan for the first i where missing(i) >= n—correct but O(L), not the asked O(log L).
  • Off-by-one on the indexing baseline—using arr[i] - i instead of arr[i] - (i + 1) because of 0-vs-1-based confusion.
  • Off-by-one on the answer recovery—returning lo or n + lo - 1 instead of n + lo.
  • Initializing hi = len(arr) - 1—blocks the loop from reaching the past-end case where the answer is beyond arr[-1].
  • Forgetting the strict-increasing assumption—the invariant fails silently on inputs with duplicates.

Python Interview Question on the Nth Missing Integer

Given a strictly increasing array of positive integers arr (each value ≥ 1) and a positive integer n, return the Nth positive integer that is missing from arr, comparing against the contiguous baseline [1, 2, 3, ...]. Solve in O(log L) time, where L = len(arr).

Solution Using lower-bound binary search on the missing-count invariant

def find_nth_missing(arr, n):
    lo, hi = 0, len(arr)
    while lo < hi:
        mid = (lo + hi) // 2
        # missing positives below arr[mid] = arr[mid] - (mid + 1)
        if arr[mid] - (mid + 1) < n:
            lo = mid + 1
        else:
            hi = mid
    return n + lo
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace (input: arr = [2, 3, 5, 9, 12], n = 4):

i arr[i] (i + 1) missing(i)
0 2 1 1
1 3 2 1
2 5 3 2
3 9 4 5
4 12 5 7
  1. Initializelo = 0, hi = 5 (array length).
  2. Iteration 1mid = (0 + 5) // 2 = 2; arr[2] = 5; missing(2) = 5 - 3 = 2. 2 < 4 → predicate not yet true; move right: lo = 3.
  3. Iteration 2mid = (3 + 5) // 2 = 4; arr[4] = 12; missing(4) = 12 - 5 = 7. 7 >= 4 → predicate true; move left: hi = 4.
  4. Iteration 3mid = (3 + 4) // 2 = 3; arr[3] = 9; missing(3) = 9 - 4 = 5. 5 >= 4 → predicate true; move left: hi = 3.
  5. Loop exitlo == hi == 3.
  6. Recover the answern + lo = 4 + 3 = 7. The 4th missing positive is 7.
  7. Spot-check — missing values below 12 are {1, 4, 6, 7, 8, 10, 11}; the 4th is 7. ✓

Output:

n nth_missing
1 1
2 4
3 6
4 7
5 8

Why this works — concept by concept:

  • Missing-count invariantarr[i] - (i + 1) exactly counts missing positives strictly below arr[i], derived from the gap between the actual value and its 1-based positional baseline.
  • Monotone predicate — because arr is strictly increasing, missing(i) never decreases; that monotonicity is the precondition for binary search.
  • Lower-bound searchwhile lo < hi with hi = len(arr) finds the smallest i where missing(i) >= n, the boundary the answer sits just below.
  • Inclusive past-end — initializing hi = len(arr) (not len(arr) - 1) lets the search converge cleanly when the answer is beyond arr[-1].
  • Closed-form recoveryn + lo works because there are lo array elements at positions 1..lo, and the Nth missing positive necessarily lives n slots beyond them.
  • CostO(log L) time (one binary search), O(1) space (no auxiliary structure).

PYTHON
Topic — binary search
Binary-search problems

Practice →

PYTHON
Lyft — two pointers
Lyft two-pointer problems

Practice →


5. SQL Time-Series Aggregation for New Customers Per Day

First-touch time-series aggregation in SQL for data engineering

"Count new customers per day" is the canonical first-touch time-series prompt for ride-share analytics. The mental model: a new customer on date D is a customer whose MIN(order_date) = D. The shape is two stacked aggregates: first compute each customer's first-touch date (one row per customer), then count customers per first-touch date (one row per day). The same shape powers daily-active-new-rider reports, weekly cohort sizes, and any first-event-per-entity rollup.

Pro tip: Pin the first-touch step in a CTE—do not nest it as a correlated subquery in the daily-count SELECT. The CTE form is faster (one scan instead of N), more readable, and matches what an interviewer expects to see when they ask you to talk through the plan.

First-touch per user: MIN(order_date) GROUP BY user_id

The first-touch invariant: MIN(order_date) GROUP BY user_id collapses each user to one row whose date is the user's first interaction. Repeated orders by the same user contribute zero to "new customers" because we keep only the minimum.

  • MIN(order_date) — earliest date per user.
  • GROUP BY user_id — one row per user.
  • Optional boundWHERE order_date >= '2026-04-01' if the report is windowed.
  • Type disciplineMIN over DATE returns DATE; over TIMESTAMP returns TIMESTAMP. Cast or DATE_TRUNC('day', …) if the time component matters.

Worked example. Five orders across three customers.

order_id user_id order_date
1 7 2026-04-01
2 8 2026-04-01
3 7 2026-04-02
4 9 2026-04-02
5 8 2026-04-03

After MIN(order_date) GROUP BY user_id:

user_id first_order_date
7 2026-04-01
8 2026-04-01
9 2026-04-02

Worked-example solution.

SELECT
  user_id,
  MIN(order_date) AS first_order_date
FROM orders
GROUP BY user_id;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: "first time a thing happened per entity" → MIN(timestamp) GROUP BY entity_id.

Daily roll-up: COUNT(*) GROUP BY first_date

Once each customer has a first-touch date, the daily roll-up is one COUNT over the first-touch column. The result is one row per day with new_customers = COUNT(*). Dates with no new customers do not appear in this output—if you need a fully-densified daily series, left-join against a calendar table.

  • COUNT(*) — counts users (each row in the first-touch CTE is one distinct user).
  • GROUP BY first_order_date — bucket by day.
  • ORDER BY first_order_date — ascending date for time-series reading.
  • Sparse output — days without new customers are absent; densify with a generated date series + LEFT JOIN if needed.

Worked example. Continuing the previous CTE.

first_order_date count step
2026-04-01 2 (users 7, 8)
2026-04-02 1 (user 9)

After COUNT(*) GROUP BY first_order_date:

order_day new_customers
2026-04-01 2
2026-04-02 1

Worked-example solution.

WITH first_touch AS (
  SELECT user_id, MIN(order_date) AS first_order_date
  FROM orders
  GROUP BY user_id
)
SELECT
  first_order_date AS order_day,
  COUNT(*) AS new_customers
FROM first_touch
GROUP BY first_order_date
ORDER BY first_order_date;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: once each user is collapsed to a single first-touch row, COUNT(*) GROUP BY date is the simplest possible roll-up—no DISTINCT needed.

Subquery / CTE composition: feed first-touch into a daily count

The composition invariant: the inner aggregate must be one row per user; the outer aggregate then counts those users per first-touch date. You can express the inner step as a CTE, a derived table, or (rarely) a correlated subquery—stick to CTE for readability and planner friendliness.

  • CTE formWITH first_touch AS (...) SELECT ... FROM first_touch. Read top-down; one scan of orders.
  • Derived-table formFROM (SELECT user_id, MIN(...) FROM orders GROUP BY user_id) ft. Equivalent plan, less readable.
  • Correlated subquery (avoid)WHERE order_date = (SELECT MIN(order_date) FROM orders o2 WHERE o2.user_id = o.user_id). Slow on most engines unless the optimizer rewrites it.
  • Idempotency — running the same query twice on stable input yields the same answer; no order-dependent logic.

Worked example. Same orders table; trade the CTE for an equivalent derived table.

variant shape readability
CTE WITH first_touch AS (...) high
derived table FROM (...) ft medium
correlated subquery WHERE order_date = (SELECT MIN ...) low + slow

Worked-example solution.

-- Equivalent derived-table form
SELECT
  ft.first_order_date AS order_day,
  COUNT(*) AS new_customers
FROM (
  SELECT user_id, MIN(order_date) AS first_order_date
  FROM orders
  GROUP BY user_id
) ft
GROUP BY ft.first_order_date
ORDER BY ft.first_order_date;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: every "two-stacked-aggregate" question is the same pattern—derive a per-entity summary, then aggregate the summary. Pin the inner step as a CTE.

Common beginner mistakes

  • Counting COUNT(DISTINCT user_id) over the raw orders table grouped by order_date—counts active users that day, not new customers.
  • Using MIN(order_date) without GROUP BY user_id—returns one row, the global earliest date.
  • Forgetting ORDER BY first_order_date when the consumer expects a time-sorted report.
  • Densifying with a wrong calendar—leaving "0" days as missing or filling with NULL instead of 0.
  • Joining orders to itself to "find the earliest order"—when a CTE with MIN(order_date) is one line.

SQL Interview Question on New Customers Per Day

Table orders(order_id INT, user_id INT, order_date DATE). Return one row per day with the number of customers whose first order in the system fell on that day. Columns: order_day, new_customers. Sort ascending by order_day.

Solution Using a first-touch CTE and a daily count

WITH first_touch AS (
  SELECT user_id, MIN(order_date) AS first_order_date
  FROM orders
  GROUP BY user_id
)
SELECT
  first_order_date AS order_day,
  COUNT(*) AS new_customers
FROM first_touch
GROUP BY first_order_date
ORDER BY first_order_date;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace (input: 7 rows, 4 customers across 4 days):

order_id user_id order_date
1 1 2026-04-01
2 2 2026-04-01
3 1 2026-04-02
4 3 2026-04-02
5 2 2026-04-03
6 4 2026-04-04
7 1 2026-04-04
  1. Build first_touch — group by user, take MIN(order_date). User 1 → 2026-04-01; user 2 → 2026-04-01; user 3 → 2026-04-02; user 4 → 2026-04-04. Four rows total.
  2. Outer GROUP BY first_order_date — collapse the four CTE rows by their first-touch date.
  3. Outer COUNT(*) — count rows per first-touch date.
    • 2026-04-01 → users {1, 2} → 2.
    • 2026-04-02 → user {3} → 1.
    • 2026-04-04 → user {4} → 1.
    • 2026-04-03 has no first-touch date → row absent (sparse output).
  4. ORDER BY first_order_date — ascending date order.
  5. Final result — three rows; April 3 is correctly absent because no customer's first order fell on that day.

Output:

order_day new_customers
2026-04-01 2
2026-04-02 1
2026-04-04 1

Why this works — concept by concept:

  • First-touch CTEMIN(order_date) GROUP BY user_id collapses each user into one row whose date is their entry into the system; this is the canonical "new" definition.
  • Outer aggregate — once each user is represented exactly once with their first-touch date, COUNT(*) GROUP BY first_order_date correctly counts users per "new" day.
  • No DISTINCT needed — the CTE already deduplicated users, so COUNT(*) is exactly COUNT(DISTINCT user_id)—simpler and faster to plan.
  • Sparse-by-default — days without first-touches are absent from the output; this is correct unless the prompt explicitly asks for densified zeros via a calendar table.
  • ORDER BY first_order_date — gives the consumer a time-sorted report ready to chart or page through.
  • Cost — one full scan of orders for the CTE, one hash aggregate for the outer count → O(N) time and O(U) space where U is the number of distinct users.

SQL
Topic — time series
Time-series problems

Practice →

SQL
Lyft — aggregation
Lyft aggregation problems

Practice →


6. SQL Window Functions for Consecutive-Day Buyer Cohorts

LAG-driven window functions for consecutive-day patterns in SQL for data engineering

"Find customers who bought on two consecutive days" is the canonical window-function prompt for retention. The mental model: for each customer, sort their distinct purchase dates; if any consecutive pair has a one-day gap, that customer qualifies. The structural primitive is LAG(purchase_date) OVER (PARTITION BY user_id ORDER BY purchase_date)—a per-user previous-purchase reference. The predicate is the date-arithmetic equality purchase_date = prev_date + INTERVAL '1 day'.

Diagram with a horizontal date axis showing two purchase markers per user, a LAG arrow pointing from Day N to Day N+1, and a purchase_date = prev_date + 1 predicate label.

Pro tip: Pre-deduplicate within a day. If a customer made three purchases on April 2, raw LAG over the partitioned window will compare two same-day rows and the consecutive-day predicate will misfire. SELECT DISTINCT user_id, purchase_date first, then apply LAG.

DISTINCT (user_id, purchase_date) to collapse intra-day duplicates

The dedup invariant: window functions operate on rows; multiple same-day rows distort the LAG comparison. Either project DISTINCT (user_id, purchase_date) in a CTE or GROUP BY (user_id, purchase_date) to one row per (user, day). The downstream window then sees exactly one row per user-day.

  • SELECT DISTINCT user_id, purchase_date — preserves both columns; no aggregates.
  • GROUP BY user_id, purchase_date — equivalent; some engines plan slightly differently.
  • Casting TIMESTAMP to DATEpurchase_ts::date collapses intra-day timestamps to days before the dedup.
  • One row per (user, day) — the contract that LAG-with-consecutive-day relies on.

Worked example. Three purchases for user 1: two on April 2, one on April 3. Without dedup, LAG sees two same-day rows.

user_id purchase_date (raw)
1 2026-04-02
1 2026-04-02
1 2026-04-03

After DISTINCT:

user_id purchase_date
1 2026-04-02
1 2026-04-03

Worked-example solution.

WITH user_days AS (
  SELECT DISTINCT user_id, purchase_date
  FROM purchases
)
SELECT * FROM user_days ORDER BY user_id, purchase_date;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: whenever a date column might have multiple rows per (entity, day), dedup first and apply window functions on the deduped CTE.

LAG(purchase_date) OVER (PARTITION BY user_id ORDER BY purchase_date)

The window-function invariant: LAG(col) returns the value of col from the previous row within the partition, ordered by the same key. For the consecutive-day check, partition by user (so the lookup never crosses customers) and order by date (so "previous" means "previous purchase chronologically").

  • PARTITION BY user_id — each user gets their own ordered window.
  • ORDER BY purchase_date — defines what "previous" means.
  • LAG(purchase_date) — returns NULL for the first row in each partition (no prior row).
  • LAG(col, n) — n-row lag; default n = 1.

Worked example. User 1 has purchases on April 1, 2, 5; user 2 has purchases on April 1, 4. After window:

user_id purchase_date prev_date
1 2026-04-01 NULL
1 2026-04-02 2026-04-01
1 2026-04-05 2026-04-02
2 2026-04-01 NULL
2 2026-04-04 2026-04-01

Worked-example solution.

WITH user_days AS (
  SELECT DISTINCT user_id, purchase_date
  FROM purchases
),
with_prev AS (
  SELECT
    user_id,
    purchase_date,
    LAG(purchase_date) OVER (
      PARTITION BY user_id
      ORDER BY purchase_date
    ) AS prev_date
  FROM user_days
)
SELECT * FROM with_prev ORDER BY user_id, purchase_date;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: LAG and LEAD are the right primitives for "previous/next event per entity in time order"—not self-joins.

Consecutive predicate: purchase_date = prev_date + INTERVAL '1 day'

The consecutive invariant: two purchases are on consecutive days iff purchase_date - prev_date = 1 day. In SQL flavors with date arithmetic, write prev_date + INTERVAL '1 day'; in pure DATE arithmetic, prev_date + 1 works on PostgreSQL. The predicate goes into a WHERE filter on the windowed CTE.

  • WHERE purchase_date = prev_date + INTERVAL '1 day' — strict equality on +1 day.
  • SELECT DISTINCT user_id — collapse multiple consecutive-day pairs per user to one row in the answer.
  • NULL safetyprev_date IS NULL rows (first purchase per user) are filtered out automatically because NULL + 1 = NULL and purchase_date = NULL is never true.
  • Generalize to N consecutive days — chain N-1 LAG calls, or use ROW_NUMBER minus purchase_date (gaps-and-islands).

Worked example. Continuing the windowed table above, apply the predicate.

user_id purchase_date prev_date consecutive?
1 2026-04-01 NULL no
1 2026-04-02 2026-04-01 yes
1 2026-04-05 2026-04-02 no (gap = 3)
2 2026-04-01 NULL no
2 2026-04-04 2026-04-01 no (gap = 3)

User 1 qualifies; user 2 does not.

Worked-example solution.

WITH user_days AS (
  SELECT DISTINCT user_id, purchase_date
  FROM purchases
),
with_prev AS (
  SELECT
    user_id,
    purchase_date,
    LAG(purchase_date) OVER (
      PARTITION BY user_id
      ORDER BY purchase_date
    ) AS prev_date
  FROM user_days
)
SELECT DISTINCT user_id
FROM with_prev
WHERE purchase_date = prev_date + INTERVAL '1 day'
ORDER BY user_id;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: always pair LAG with DISTINCT on the dedup CTE and DISTINCT user_id in the final projection—otherwise a customer with three consecutive days appears twice.

Common beginner mistakes

  • Skipping the DISTINCT (user_id, purchase_date) step—same-day duplicates make LAG compare same-day rows and the predicate misfires.
  • Using INTERVAL '1 day' in dialects that don't accept it (e.g., MySQL needs DATE_ADD(prev_date, INTERVAL 1 DAY))—engine-specific syntax matters.
  • Omitting PARTITION BY user_idLAG then crosses users and produces meaningless "previous-customer-on-different-day" comparisons.
  • Forgetting DISTINCT user_id in the final projection—a user with multiple consecutive-day pairs is counted multiple times.
  • Using purchase_date - prev_date <= 1—matches same-day duplicates if they survived the dedup; use strict = 1.

SQL Interview Question on Customers Who Bought on 2 Consecutive Days

Table purchases(user_id INT, purchase_date DATE) with possibly multiple rows per (user, date). Return the distinct user_ids of customers who made purchases on at least one pair of consecutive calendar days. Sort ascending by user_id.

Solution Using LAG with a per-user partition

WITH user_days AS (
  SELECT DISTINCT user_id, purchase_date
  FROM purchases
),
with_prev AS (
  SELECT
    user_id,
    purchase_date,
    LAG(purchase_date) OVER (
      PARTITION BY user_id
      ORDER BY purchase_date
    ) AS prev_date
  FROM user_days
)
SELECT DISTINCT user_id
FROM with_prev
WHERE purchase_date = prev_date + INTERVAL '1 day'
ORDER BY user_id;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace (input: 9 purchases across 3 customers):

user_id purchase_date
1 2026-04-01
1 2026-04-02
1 2026-04-02
1 2026-04-05
2 2026-04-01
2 2026-04-04
3 2026-04-10
3 2026-04-11
3 2026-04-12
  1. Build user_daysDISTINCT (user_id, purchase_date) collapses the duplicate (1, 2026-04-02) to one row. Result has 8 rows.
  2. Build with_prev — for each row, compute LAG(purchase_date) OVER (PARTITION BY user_id ORDER BY purchase_date).

| user_id | purchase_date | prev_date |
|--------:|---------------|-----------|
| 1 | 2026-04-01 | NULL |
| 1 | 2026-04-02 | 2026-04-01 |
| 1 | 2026-04-05 | 2026-04-02 |
| 2 | 2026-04-01 | NULL |
| 2 | 2026-04-04 | 2026-04-01 |
| 3 | 2026-04-10 | NULL |
| 3 | 2026-04-11 | 2026-04-10 |
| 3 | 2026-04-12 | 2026-04-11 |

  1. Apply the consecutive predicatepurchase_date = prev_date + INTERVAL '1 day':
    • User 1, 2026-04-02 vs. 2026-04-01 → match.
    • User 1, 2026-04-05 vs. 2026-04-02 → no (gap = 3).
    • User 2, 2026-04-04 vs. 2026-04-01 → no (gap = 3).
    • User 3, 2026-04-11 vs. 2026-04-10 → match.
    • User 3, 2026-04-12 vs. 2026-04-11 → match (second consecutive pair).
  2. SELECT DISTINCT user_id — collapse the three matches across users 1 and 3 to two rows.
  3. ORDER BY user_id — final ascending result.

Output:

user_id
1
3

Why this works — concept by concept:

  • DISTINCT (user_id, purchase_date) dedup — guarantees one row per user-day, so the LAG comparison is between actual prior-day purchases, not intra-day duplicates that would corrupt the predicate.
  • Partitioned windowPARTITION BY user_id ORDER BY purchase_date gives each customer an isolated, time-ordered view; LAG never crosses users.
  • LAG(purchase_date) — returns the immediately previous purchase date within the partition; NULL on the first row per user, automatically handling "no prior purchase" without a special case.
  • Strict-equality predicatepurchase_date = prev_date + INTERVAL '1 day' matches exactly the consecutive-day case; NULL + 1 = NULL, so first-row NULLs are silently dropped.
  • SELECT DISTINCT user_id — collapses multiple qualifying pairs per user (e.g., a 3-day streak produces two pairs) into one answer row.
  • Generalizes naturally — for "N consecutive days," chain N-1 LAG calls or switch to the gaps-and-islands ROW_NUMBER() - purchase_date trick.
  • Cost — one scan of purchases for the dedup CTE, one window pass over the deduped rows, one final scan with the predicate filter → O(N log N) if the engine sorts inside the window, otherwise O(N) with a hash-partition operator.
SQL Topic — window functions Window-function problems

Practice →

SQL
Topic — date arithmetic
Date-arithmetic problems

Practice →


Tips to Crack Lyft Data Engineering Interviews

Drill the four Lyft Python primitives, in order

The four Python primitives in the curated Lyft Python practice set are heap-backed multi-stream merge, prefix-bucket hash autocomplete, Counter-style array intersection, and binary-search on the Nth-missing-integer invariant. Each maps to a specific module: heapq for #14, collections.defaultdict for #15, collections.Counter for #22, vanilla list-and-arithmetic for #125. Learn the import lines as muscle memory—import heapq, from collections import defaultdict, Counter—and the rest of the prompt becomes pattern recognition.

Treat SQL as a two-step composition

Both Lyft SQL prompts are two stacked aggregates: §5 is "first-touch per user, then count per first-touch date"; §6 is "dedup per user-day, then LAG over partitioned window, then filter." When you see a Lyft SQL prompt, ask yourself: what's the per-entity intermediate, and what's the final aggregate over that intermediate? Pin the intermediate in a CTE, name it after what it is (first_touch, user_days, with_prev), and write the outer query as if the CTE were a real table. The Lyft SQL practice page drills exactly this pattern.

Easy-tier discipline matters more than medium-tier intuition

Three of the six Lyft problems are EASY-tier—#14, #15, #22. Easy at Lyft does not mean trivial; it means the interviewer expects zero hesitation, no off-by-ones, clean variable names, and an articulated invariant. A correct EASY answer with stuttering and one bug is graded worse than a correct MEDIUM answer with the same bug, because the EASY signal is supposed to be "this candidate writes Python fluently." The Lyft easy-tier practice set is the right warmup.

Mobility-data framing on every prompt

Frame your verbal walkthrough in ride-share vocabulary: rider events, driver streams, place autocomplete, route segments, trip-id arrays, daily new riders, return-day cohorts. Interviewers test whether you can translate the abstract algorithm to the business question—matching the interviewer's mental model of what their team builds. The Lyft medium practice set is mostly framed in mobility analytics terms; mirror that vocabulary in your answer.

State the invariant out loud, then write code

For binary-search problems (§4) and window-function problems (§6), the invariant is the answer: arr[i] - (i + 1) = missing(i) for §4, purchase_date = prev_date + INTERVAL '1 day' for §6. Write the invariant on the whiteboard before any code. Interviewers grade the explicit naming of the invariant more than the lines of code that implement it; getting to the answer fast without naming the invariant suggests memorization rather than understanding.

Where to practice on PipeCode

Start with the Lyft practice page for the curated 6-problem set. Then drill the two surface-area pages: Lyft Python practice for #14 / #15 / #22 / #125, Lyft SQL practice for #201 / #202. For broad topic coverage, browse by topic and pick the matching primitives—queue, hash table, two pointers, binary search, window functions, date arithmetic. The interview courses page bundles the SQL and Python courses if you want a structured path before the company drills.

Communication and approach under time pressure

Talk through the invariant first, the brute force second, and the optimal third—even if you go straight to the optimal in code. Interviewers grade process as much as the final answer. Leave 5 minutes at the end of each problem for an edge-case sweep: empty inputs, single-element arrays, all-duplicates, all-distinct, NULL-prev-date in window functions, exhausted streams in heap merges. The most common "almost passed" failure mode is correct happy-path code that crashes on the empty input—a 30-second sweep prevents it.


Frequently Asked Questions

What is the Lyft data engineering interview process like?

Lyft's data engineering interview typically opens with a recruiter screen, then a technical phone screen with one Python or SQL coding problem, then an onsite (or virtual onsite) of four to five rounds: two coding rounds focused on Python data structures and SQL analytics, one data-modeling or system-design discussion (rider/driver pipelines, trip aggregates, real-time event handling), and one or two behavioral rounds. The coding rounds are heavy on the patterns covered in this guide—heap-backed queues, hash-table lookups, two-pointer intersection, binary-search invariants, time-series aggregations, and window-function retention queries.

What Python topics does Lyft test for data engineers?

Lyft Python interviews concentrate on four primitives: heapq for k-way merge and priority queues (#14 Multi Stream Reader), collections.defaultdict for prefix-keyed hash buckets (#15 Text Search Auto-Complete), collections.Counter and frequency dicts for multi-set semantics (#22 Array Intersection Without Set), and lower-bound binary search over a monotone count-of-missing invariant (#125 Nth Missing Integer). All four problems are framed in mobility-data vocabulary—event streams, place autocomplete, route segments, sparse trip-id arrays. Practice the imports as muscle memory: import heapq and from collections import defaultdict, Counter.

How important is SQL for a Lyft data engineering interview?

SQL is one third of Lyft's coding signal—two of the six curated problems are SQL. Both are MEDIUM difficulty and both reward time-series thinking: #201 New Customers Per Day tests first-touch aggregation (MIN(order_date) GROUP BY user_id then COUNT GROUP BY first_date), and #202 Customers Who Bought on 2 Consecutive Days tests LAG over a user-partitioned window plus the consecutive-day predicate purchase_date = prev_date + INTERVAL '1 day'. Expect at least one SQL round where you have to write window-function or two-stage-aggregate SQL on the spot.

How hard are Lyft data engineering interview questions?

The curated Lyft set is 3 easy + 3 medium + 0 hard—a fundamentals-first profile. The hardest of the medium tier is #125 Nth Missing Integer, where the count-of-missing invariant arr[i] - (i + 1) is the crux; without it, candidates default to O(L) linear scans or array expansion. The two SQL mediums (#201 and #202) are harder than they look on the surface: the DISTINCT (user_id, purchase_date) dedup before LAG and the MIN(order_date) GROUP BY user_id first-touch CTE are both invariants that interviewers expect you to articulate, not just code.

Are window functions and consecutive-day patterns common in Lyft interviews?

Yes—#202 Customers Who Bought on 2 Consecutive Days is the explicit test of LAG, partition-by, and date-arithmetic predicates. Lyft's analytics surface (driver retention, rider return frequency, two-trip-day cohorts) leans on this pattern in production, so the interview prompt mirrors real on-the-job SQL. Drill LAG / LEAD over partitioned windows, the gaps-and-islands ROW_NUMBER() - date trick for N-consecutive-day queries, and the strict-equality date predicate. The general window-functions practice page and the date-arithmetic practice page cover both.

How many Lyft practice problems should I solve before the interview?

Solve all 6 problems on the Lyft practice page end-to-end—untimed first, then timed at 25 minutes per problem. After that, broaden to 30 to 40 additional problems spread across the matching topic pages: queue and design, hash table, two pointers, binary search, time series, window functions, date arithmetic. The Lyft Python practice page and Lyft SQL practice page are the right surfaces for the final week of prep before the loop.


Start practicing Lyft data engineering problems

Top comments (0)