DEV Community

Cover image for PayPal Data Engineering Interview Questions
Gowtham Potureddi
Gowtham Potureddi

Posted on

PayPal Data Engineering Interview Questions

PayPal data engineering interview questions are Python end-to-end with a payments-analytics edge: five Python primitives (list comprehensions for filter-then-map array transformations, operator semantics across int / str / list / tuple for type-operation outputs, defaultdict(int) aggregation for per-key sales rollups, set algebra (&, |, -) for recommendation engines that surface shared favorites, and BFS-based 2-coloring on adjacency dicts for bipartite seating-arrangement validation). The framings are merchant and platform data—catering events, restaurant recommendations, seat assignments, and the everyday transformations a data engineer applies to per-row payment and merchant feeds.

This guide walks through the five topic clusters PayPal actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 5-problem PayPal set (3 easy, 2 medium, no hard)—a Python-fundamentals hub that rewards comprehension fluency, type-quirk awareness, and clean graph reasoning over algorithm-puzzle prep. There is no SQL in this loop; every minute of preparation should go to Python.

PayPal data engineering interview questions cover image with bold headline, Python chip, and pipecode.ai attribution.


Top PayPal data engineering interview topics

From the PayPal data engineering practice set, the five numbered sections below follow this topic map (one row per H2):

# Topic (sections 1–5) Why it shows up at PayPal
1 Python list comprehensions for array transformations List Comprehension Conversion—rewrite a for / append / if loop as [expr for x in iter if cond] in one line.
2 Python type operations and operator semantics Python Type Operations Output—predict + * == is results across int / str / list / tuple, including small-int caching and sequence concat.
3 Hash tables and defaultdict aggregation for sales reports Catering Sales Report—defaultdict(int) per-event totals, then sorted(..., key=lambda kv: -kv[1]) for ranked output.
4 Set intersection for recommendation engines Restaurant Recommendations—per-user set of liked places, intersect to surface shared favorites, difference to filter "new only."
5 Graphs and sets for bipartite seating validation Valid Seating Arrangement—adjacency dict from a list of conflicts, BFS 2-coloring, same-color edge means not bipartite.

Payments-and-platform framing rule: PayPal's prompts span everyday data-engineering Python on payments and merchant data—filter-map array transforms, type-aware parsing of mixed inputs, per-key aggregation for revenue rollups, set-based recommendations, and graph integrity checks. The interviewer is grading whether you map each business framing to the right Python primitive: filter-then-map → list comprehension; type quirks → operator-semantic awareness; per-key totals → defaultdict(int); "users who like both" → set intersection; "valid pairing on a conflict graph" → BFS bipartite check. State the mapping out loud.


1. Python List Comprehensions for Array Transformations

List comprehensions for filter-then-map array transforms in Python for data engineering

"Convert this for / append / if loop into a list comprehension" is the canonical Python-fluency interview prompt at PayPal. The mental model: a comprehension [expr for x in iter if cond] is a one-line filter-then-map pipeline. Every imperative for loop that builds a list with append (with optional if filtering) collapses into one comprehension. PayPal grades whether you can rewrite the imperative loop without changing semantics—same input, same output, same edge cases.

Pro tip: A comprehension and an append loop have the same time complexity but the comprehension is meaningfully faster (lower per-iteration overhead) and is what PayPal's interviewers expect to see when they ask "can you make this Pythonic?" Don't reach for filter() + map() + list()—that is a 3-call chain that most reviewers will rewrite as a comprehension anyway.

Comprehension syntax: [expr for x in iter if cond]

The comprehension invariant: for each x in iter, if cond is truthy, evaluate expr and collect the result. Reading order: expr, then for x in iter, then if cond. Execution order: scan iter, filter by cond, transform with expr.

  • [expr for x in iter] — pure map; one output per input.
  • [expr for x in iter if cond] — filter-then-map; outputs only where cond is truthy.
  • {expr for x in iter} — set comprehension; deduplicates results.
  • {k: v for k, v in iter} — dict comprehension.
  • (expr for x in iter) — generator expression; lazy, memory-friendly for large inputs.

Worked example. Convert this loop into a comprehension.

result = []
for x in [1, -2, 3, -4, 5]:
    if x > 0:
        result.append(x * x)
# result = [1, 9, 25]
Enter fullscreen mode Exit fullscreen mode

Worked-example solution.

result = [x * x for x in [1, -2, 3, -4, 5] if x > 0]
# [1, 9, 25]
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if the body of the loop is one if followed by one append, it is a comprehension. If the body has any other statement (a print, a counter, an early break), keep the loop.

Filter-map fusion vs separate filter + map calls

The fusion invariant: a comprehension does the filter and the map in one pass. Equivalent filter(...) and map(...) calls produce the same result but read worse and force an extra list(...) materialization in Python 3 (where both return iterators).

  • [expr for x in iter if cond] — one pass, one allocation, idiomatic.
  • list(map(lambda x: expr, filter(lambda x: cond, iter))) — three function calls; harder to read.
  • Generator expressionsum(x*x for x in nums if x > 0) skips the intermediate list entirely.
  • itertools chains — for streaming pipelines without materializing.

Worked example. Three equivalent ways to compute the squares of positive numbers.

approach code output
comprehension [x*x for x in nums if x > 0] [1, 9, 25]
filter + map list(map(lambda x: x*x, filter(lambda x: x>0, nums))) [1, 9, 25]
generator (sum) sum(x*x for x in nums if x > 0) 35

Worked-example solution.

nums = [1, -2, 3, -4, 5]

# Pythonic
squares_pos = [x * x for x in nums if x > 0]

# Equivalent but verbose
squares_pos_alt = list(map(lambda x: x * x, filter(lambda x: x > 0, nums)))

assert squares_pos == squares_pos_alt == [1, 9, 25]
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: prefer comprehensions; reach for filter / map only when you already have named functions to plug in (e.g. list(map(int, raw_strs))).

Nested comprehensions and conditional expressions inside expr

The composition invariant: two transformations can live inside the same comprehension—the expr slot can be a conditional expression, and a second for clause introduces nesting. Used carefully, this expresses cross products and label assignments in one line; used carelessly, it produces unreadable code.

  • Conditional expression in expr[x if x >= 0 else -x for x in nums] (clamp to absolute value).
  • Two for clauses[(a, b) for a in xs for b in ys] (Cartesian product; outer-loop-first reading order).
  • Combined filter and conditional expression[("pos" if x > 0 else "zero") for x in nums if x >= 0].
  • Stop at one nest level — beyond two for clauses, switch to a regular loop or itertools.product.

Worked example. Replace negatives with their absolute value, then keep only values whose absolute is ≤ 4.

nums = [-5, -3, 0, 2, 4, 7]
# Goal: [3, 0, 2, 4]
Enter fullscreen mode Exit fullscreen mode

Worked-example solution.

result = [
    abs(x)
    for x in [-5, -3, 0, 2, 4, 7]
    if abs(x) <= 4
]
# [3, 0, 2, 4]
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if cond and expr can each call any function—including abs, int, str.lower—so most "transform-then-filter-then-transform" pipelines are still one comprehension.

Common beginner mistakes

  • Using filter() / map() then forgetting list(...) in Python 3—returns an iterator, not a list.
  • Putting the if before the for—syntax error in a comprehension (the if is a filter clause, not a top-level conditional).
  • Mutating the iterable inside the comprehension—undefined behavior; build into a fresh list instead.
  • Writing a 3-level nested comprehension—correct but unreadable; use a regular loop.
  • Forgetting that the leftmost for is the outermost loop in nested comprehensions—reverse mental model and your output is transposed.

Python Interview Question on List Comprehension Conversion

Convert the loop below into a single list comprehension that produces the same output.

def positive_doubles(nums):
    out = []
    for n in nums:
        if n > 0:
            out.append(n * 2)
    return out
Enter fullscreen mode Exit fullscreen mode

The function takes a list of integers and returns a list of doubled values for positive entries only. Preserve the input order.

Solution Using a single [expr for x in iter if cond] comprehension

def positive_doubles(nums):
    return [n * 2 for n in nums if n > 0]
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace (input: nums = [3, -1, 4, 0, -7, 2]):

index n n > 0? n * 2 output so far
0 3 yes 6 [6]
1 -1 no [6]
2 4 yes 8 [6, 8]
3 0 no [6, 8]
4 -7 no [6, 8]
5 2 yes 4 [6, 8, 4]
  1. Iterate nums in order; the comprehension is left-to-right.
  2. Filter if n > 00 and negatives drop; 3, 4, 2 pass.
  3. Map n * 2 — emit 6, 8, 4 in that order.
  4. Collect into a list — return [6, 8, 4].

Output:

return value
[6, 8, 4]

Why this works — concept by concept:

  • for n in nums — iterates the input in order; comprehensions preserve insertion order, so the output is order-stable.
  • if n > 0 filter clause — drops every value that doesn't satisfy the predicate before the expression runs; 0 and negatives never reach n * 2.
  • n * 2 expression — the only transformation applied to surviving values; lives in the expr slot of the comprehension.
  • One-pass evaluation — the comprehension iterates once, filters, and emits; no intermediate list of "all values" is built.
  • List allocation — Python pre-sizes the result list when it can predict the length, otherwise grows it amortized; either way the comprehension is faster than append in a loop.
  • CostO(N) time over the input length; O(K) space where K is the number of positives.

PYTHON
Topic — list comprehension
List-comprehension problems

Practice →

PYTHON
PayPal — array
PayPal array problems

Practice →


2. Python Type Operations and Operator Semantics

Operator semantics across Python types for data engineering

"Predict the output of these operator expressions" is PayPal's signature Python-fluency screen. The mental model: +, *, ==, and is mean different things depending on operand types. + is numeric add for ints, sequence concat for strings/lists/tuples. * is numeric multiply for ints, sequence-replicate when one side is an int. == is value equality for everything. is is identity—only true when both names point at the same object in memory. Knowing how each operator dispatches is the difference between confidently predicting [1, 2] * 3 == [1, 2, 1, 2, 1, 2] and guessing.

Pro tip: Python caches small ints (typically -5 to 256) so a is b returns True for a = 256; b = 256 but False for a = 257; b = 257. Never use is for value comparison—use ==. Use is only for is None, is True, is False, and identity checks against sentinel objects.

+ overloading: numeric add vs sequence concat

The + invariant: for numerics it adds; for sequences it concatenates; mixed types raise TypeError. The dispatch is based on the left operand's __add__ method.

  • int + int5 + 3 == 8, numeric.
  • str + str"ab" + "cd" == "abcd", concat.
  • list + list[1, 2] + [3] == [1, 2, 3], concat.
  • tuple + tuple(1, 2) + (3,) == (1, 2, 3), concat (note the trailing comma for one-tuple).
  • int + strTypeError; mixed types do not implicitly convert.

Worked example. Predict the output for each line.

expression result
1 + 2 3
"a" + "b" "ab"
[1] + [2, 3] [1, 2, 3]
(1,) + (2, 3) (1, 2, 3)
"a" + 1 TypeError

Worked-example solution.

print(1 + 2)              # 3
print("a" + "b")          # "ab"
print([1] + [2, 3])       # [1, 2, 3]
print((1,) + (2, 3))      # (1, 2, 3)

try:
    print("a" + 1)
except TypeError as e:
    print(f"TypeError: {e}")
# TypeError: can only concatenate str (not "int") to str
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: + always means "the same kind of thing on both sides"; mixing types means an explicit cast (str(1) or int("1")) before the operation.

* repeating: int repeat vs sequence-replicate vs Cartesian semantics

The * invariant: for numerics it multiplies; for sequence × int it replicates the sequence that many times; sequence × sequence raises TypeError. The replicate count must be a non-negative integer; zero or negative gives an empty sequence.

  • int * int3 * 4 == 12, numeric.
  • "ab" * 3"ababab", replicate.
  • [1, 2] * 3[1, 2, 1, 2, 1, 2], replicate.
  • [0] * 5[0, 0, 0, 0, 0], common idiom for fixed-length zero list.
  • [[]] * 3[[], [], []] but all three inner lists are the same object; mutating one mutates all (gotcha).

Worked example. Predict each line.

expression result
3 * 4 12
"ab" * 3 "ababab"
[1, 2] * 3 [1, 2, 1, 2, 1, 2]
[0] * 5 [0, 0, 0, 0, 0]
[1, 2] * 0 []
[1, 2] * -3 [] (treated as zero)

Worked-example solution.

print(3 * 4)              # 12
print("ab" * 3)           # "ababab"
print([1, 2] * 3)         # [1, 2, 1, 2, 1, 2]
print([0] * 5)            # [0, 0, 0, 0, 0]
print([1, 2] * 0)         # []

# Aliasing gotcha
grid = [[]] * 3
grid[0].append("hi")
print(grid)               # [['hi'], ['hi'], ['hi']]  — all share the same inner list!

# Correct way to make 3 independent inner lists
grid_safe = [[] for _ in range(3)]
grid_safe[0].append("hi")
print(grid_safe)          # [['hi'], [], []]
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: [X] * N is fine when X is immutable (int, str, tuple); for mutable inner objects use [copy.copy(X) for _ in range(N)] or a comprehension.

== value equality vs is identity (and small-int caching)

The equality invariant: == calls __eq__ and tests value; is tests whether two names refer to the exact same object. They usually agree for small ints and short strings (because of caching) and usually disagree for newly-constructed mutables.

  • 5 == 5True (value).
  • 5 is 5True for small ints (cache); avoid relying on it.
  • [1, 2] == [1, 2]True (value).
  • [1, 2] is [1, 2]False (two distinct list objects).
  • x is None — the canonical pattern; None is a singleton.

Worked example. Predict each.

expression result reason
5 == 5 True value
5 is 5 True small-int cache
1000 is 1000 usually True (literal in same compile unit) compiler folding
a = 1000; b = 1000; a is b usually False (in CPython REPL) distinct allocations outside cache
[1, 2] == [1, 2] True value
[1, 2] is [1, 2] False distinct lists

Worked-example solution.

print(5 == 5)             # True
print(5 is 5)             # True (small-int cache)

a = 1000
b = 1000
print(a == b)             # True
print(a is b)             # implementation-dependent: usually False outside the REPL

print([1, 2] == [1, 2])   # True
print([1, 2] is [1, 2])   # False — two distinct list objects

x = None
print(x is None)          # True — the idiomatic check
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: use == for "same value"; use is only for is None, is True, is False, and identity comparisons against sentinel singletons.

Common beginner mistakes

  • Using is for value equality—works for small ints, breaks elsewhere; always use ==.
  • Assuming [1, 2] + 3 works—it doesn't; TypeError.
  • Building [[]] * N and mutating inner lists—aliasing bug.
  • Expecting "abc" * 1.5 to work—TypeError; replicate counts must be int.
  • Confusing string concat speed: "" + s for ... is O(N²); use "".join(parts) for O(N).

Python Interview Question on Type Operations Output

Predict the output for each statement below in order. Then return them as a list of strings ("True", "False", "TypeError", or the printed value).

1 + 2
"a" + "b" * 3
[1] + [2, 3]
[1, 2] * 3
[1, 2] == [1, 2]
[1, 2] is [1, 2]
None is None
5 == 5.0
5 is 5.0
"" * 100
Enter fullscreen mode Exit fullscreen mode

Solution Using a manual evaluation that respects operator precedence

results = [
    "3",                  # 1 + 2
    "abbb",               # "a" + ("b" * 3) — * binds tighter than +
    "[1, 2, 3]",          # [1] + [2, 3]
    "[1, 2, 1, 2, 1, 2]", # [1, 2] * 3
    "True",               # value equality on lists
    "False",              # distinct list objects
    "True",               # None is a singleton
    "True",               # 5 == 5.0 — Python compares numeric values across types
    "False",              # 5 is 5.0 — different types, different objects
    "''",                 # empty string * 100 = empty string
]
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace:

  1. 1 + 2 — int + int → 3.
  2. "a" + "b" * 3* has higher precedence than +, so this is "a" + ("b" * 3) = "a" + "bbb" = "abbb".
  3. [1] + [2, 3] — list concat → [1, 2, 3].
  4. [1, 2] * 3 — list replicate → [1, 2, 1, 2, 1, 2].
  5. [1, 2] == [1, 2] — value equality on lists; both have the same contents → True.
  6. [1, 2] is [1, 2] — identity test; the two literal lists are distinct objects → False.
  7. None is NoneNone is a singleton; both names point at the same object → True.
  8. 5 == 5.0 — Python compares numeric values across int and floatTrue.
  9. 5 is 5.05 is an int object; 5.0 is a float object; different types → False.
  10. "" * 100 — replicating an empty string any number of times yields the empty string → "".

Output:

line result
1 3
2 "abbb"
3 [1, 2, 3]
4 [1, 2, 1, 2, 1, 2]
5 True
6 False
7 True
8 True
9 False
10 ""

Why this works — concept by concept:

  • Operator precedence* binds tighter than +, so "a" + "b" * 3 reads as "a" + ("b" * 3) and the result is "abbb", not "ababab".
  • Polymorphic + — int + int adds, sequence + sequence concatenates; the dispatch is based on the operand types.
  • Polymorphic * — int * int multiplies; sequence * int replicates.
  • List value equality== on lists compares element-by-element, so [1, 2] == [1, 2] is True even when the two lists are distinct objects.
  • List identityis tests whether two names point at the same object; two list literals on different lines are different objects, so is is False.
  • None singleton — there is only one None in a Python program; is None is the canonical null check.
  • Cross-type numeric == — Python upcasts 5 to 5.0 for the comparison, returning True; is does no such conversion and returns False because the types differ.
  • Empty-string replicate"" * N is empty for any N; multiplying nothing by any count is still nothing.
  • Cost — every operation is O(1) in the size of these literals; the trace itself is constant-time.

PYTHON
Topic — type handling
Type-handling problems

Practice →

PYTHON
Topic — list comprehension
List-comprehension problems

Practice →


3. Hash Tables and defaultdict Aggregation for Sales Reports

defaultdict aggregation for per-key totals in Python for data engineering

"Build a sales report grouped by event type" is the canonical hash-table aggregation prompt for PayPal's catering / merchant flows. The mental model: defaultdict(int) lets you do counts[key] += amount without checking for key existence; the dict accumulates per-key totals in one pass; a final sorted(...) step ranks the output. This is the everyday DE pattern for any per-entity rollup—per-merchant fees, per-category revenue, per-cohort signups.

Diagram showing a list of catering sales rows on the left feeding a defaultdict(int) accumulator that maintains per-event totals, with a sorted output list on the right.

Pro tip: If the value field is an int total, defaultdict(int) is right; if you need a sum and a count, defaultdict(lambda: [0, 0]) (or a small dataclass) is cleaner than two parallel dicts. Use Counter only when the input is a flat iterable of keys—not when each row carries a separate amount.

defaultdict(int) for per-key totals

The defaultdict invariant: defaultdict(factory) returns the factory's value when a key is missing. With factory int, missing keys default to 0, so += always works without a KeyError. The structure: event_type → running total.

  • from collections import defaultdict — the import.
  • totals = defaultdict(int) — declare with factory int (returns 0 on missing key).
  • totals[k] += amount — accumulate; missing key auto-initializes to 0.
  • dict(totals) — convert to a plain dict for return / serialization (optional but idiomatic).

Worked example. Three catering rows: ('wedding', 200), ('birthday', 80), ('wedding', 150).

step row totals before action totals after
1 ('wedding', 200) {} totals['wedding'] += 200 {'wedding': 200}
2 ('birthday', 80) {'wedding': 200} totals['birthday'] += 80 {'wedding': 200, 'birthday': 80}
3 ('wedding', 150) {'wedding': 200, 'birthday': 80} totals['wedding'] += 150 {'wedding': 350, 'birthday': 80}

Worked-example solution.

from collections import defaultdict

rows = [('wedding', 200), ('birthday', 80), ('wedding', 150)]
totals = defaultdict(int)
for event, amount in rows:
    totals[event] += amount

print(dict(totals))
# {'wedding': 350, 'birthday': 80}
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if your loop body is if k not in d: d[k] = 0; d[k] += v, that is exactly what defaultdict(int) automates—replace it.

dict.items() + sorted(..., key=lambda kv: -kv[1]) for ranked output

The ranking invariant: sorted returns a new list of (key, value) tuples ordered by the key function. For descending revenue, sort by negative amount; alternatively, pass reverse=True.

  • sorted(d.items(), key=lambda kv: -kv[1]) — descending by value (negate).
  • sorted(d.items(), key=lambda kv: kv[1], reverse=True) — equivalent; arguably clearer.
  • Tie-break by keysorted(d.items(), key=lambda kv: (-kv[1], kv[0])) (descending value, ascending key on ties).
  • Top-N — slice with [:N] after sorting, or use heapq.nlargest(N, d.items(), key=lambda kv: kv[1]) for O(M log N).

Worked example. Continuing the previous totals.

sort key output
key=lambda kv: -kv[1] [('wedding', 350), ('birthday', 80)]
key=lambda kv: kv[1], reverse=True [('wedding', 350), ('birthday', 80)]
key=lambda kv: (-kv[1], kv[0]) [('wedding', 350), ('birthday', 80)]

Worked-example solution.

ranked = sorted(totals.items(), key=lambda kv: (-kv[1], kv[0]))
print(ranked)
# [('wedding', 350), ('birthday', 80)]

# Top 1 (most-revenue event)
print(ranked[0])
# ('wedding', 350)
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: always add a tie-break key; otherwise the sort is non-deterministic when two values are equal, and tests flake.

Multi-field aggregation: nested dicts vs (key1, key2) tuple keys

The composition invariant: for two-dimensional aggregates (event × city), use a tuple key (event, city) or a nested dict {event: {city: total}}. Tuple keys read more cleanly when both fields are always present; nested dicts let you ask "all cities for this event" in one lookup.

  • Tuple keytotals[(event, city)] += amount.
  • Nested dicttotals[event][city] += amount; need defaultdict(lambda: defaultdict(int)).
  • Pandas alternativedf.groupby(['event', 'city'])['amount'].sum(); same idea, vectorized.
  • Choose tuple keys for serialization to JSON; choose nested for "per-event sub-rollup" queries.

Worked example. Aggregate by (event, city).

rows = [
    ('wedding', 'NY', 200),
    ('birthday', 'SF', 80),
    ('wedding', 'NY', 150),
    ('wedding', 'SF', 100),
]
Enter fullscreen mode Exit fullscreen mode

Expected:

(event, city) total
(wedding, NY) 350
(birthday, SF) 80
(wedding, SF) 100

Worked-example solution.

from collections import defaultdict

totals = defaultdict(int)
for event, city, amount in rows:
    totals[(event, city)] += amount

for k, v in sorted(totals.items(), key=lambda kv: (-kv[1], kv[0])):
    print(k, v)
# ('wedding', 'NY') 350
# ('wedding', 'SF') 100
# ('birthday', 'SF') 80
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: tuple keys are fine until you need "all cities for one event"; at that point, switch to nested dicts.

Common beginner mistakes

  • Using dict[k] += v without defaultdictKeyError on the first row per key.
  • Using Counter when each row has a non-uniform amount—Counter only counts presence, not sums weighted amounts.
  • Sorting by -kv[1] with string values—TypeError on negation; use reverse=True.
  • Forgetting dict() cast when serializing to JSON—defaultdict is JSON-serializable but the type round-trips weirdly.
  • Mixing tuple and nested-dict aggregations in the same pipeline—pick one and stick with it.

Python Interview Question on a Catering Sales Report

You are given a list of catering sales rows of the form (event_type: str, amount: int). Return a list of (event_type, total) tuples sorted by total descending, then by event_type ascending on ties.

def catering_sales_report(rows):
    ...
Enter fullscreen mode Exit fullscreen mode

Solution Using defaultdict(int) and a tuple-keyed sorted(...)

from collections import defaultdict

def catering_sales_report(rows):
    totals = defaultdict(int)
    for event, amount in rows:
        totals[event] += amount
    return sorted(totals.items(), key=lambda kv: (-kv[1], kv[0]))
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace (input: 6 rows below):

event_type amount
wedding 200
birthday 80
wedding 150
corporate 300
birthday 40
wedding 50
  1. Initializetotals = defaultdict(int) (empty, default factory int).
  2. Row 1totals['wedding'] += 200{'wedding': 200}.
  3. Row 2totals['birthday'] += 80{'wedding': 200, 'birthday': 80}.
  4. Row 3totals['wedding'] += 150{'wedding': 350, 'birthday': 80}.
  5. Row 4totals['corporate'] += 300{'wedding': 350, 'birthday': 80, 'corporate': 300}.
  6. Row 5totals['birthday'] += 40{'wedding': 350, 'birthday': 120, 'corporate': 300}.
  7. Row 6totals['wedding'] += 50{'wedding': 400, 'birthday': 120, 'corporate': 300}.
  8. sorted(totals.items(), key=lambda kv: (-kv[1], kv[0])) — descending by total, ascending by event on ties: [('wedding', 400), ('corporate', 300), ('birthday', 120)].

Output:

event_type total
wedding 400
corporate 300
birthday 120

Why this works — concept by concept:

  • defaultdict(int) — supplies a 0 baseline for any missing key, so totals[event] += amount works on the first occurrence without a manual existence check.
  • += accumulation — adds the current row's amount onto the running total; idempotent across repeated keys.
  • dict.items() — produces the (key, value) pairs we feed into sorted.
  • Tuple sort key(-kv[1], kv[0]) sorts primarily by descending total (negate the amount) and breaks ties by ascending event name; deterministic.
  • One-pass + one-sort — aggregation is one linear pass over the rows; the sort is O(K log K) where K is the number of distinct event types, typically tiny.
  • O(N + K log K) total costN rows aggregated, K distinct keys sorted; in practice the sort is negligible.

PYTHON
Topic — hash table
Hash-table problems

Practice →

PYTHON
PayPal — hash table
PayPal hash-table problems

Practice →


4. Set Intersection for Recommendation Engines

Set algebra for recommendation queries in Python for data engineering

"Find restaurants both users like" is the canonical set-intersection interview prompt. The mental model: build a set of liked items per user; intersect (&) the sets to surface shared favorites; union (|) for combined preferences; difference (-) for "user A's picks that user B has not seen". Set algebra is O(min(len(a), len(b))) for intersection, which is why every recommendation pipeline reaches for sets before it reaches for nested loops.

Pro tip: When you have N user sets and need the items everyone likes, use reduce(set.intersection, sets) or set.intersection(*sets)—both fold across all sets in one pass. Don't manually loop with result &= s; the reduce form is one line, idiomatic, and avoids a redundant copy.

Building per-user sets from interaction lists

The construction invariant: a per-user set is a deduplicated collection of items that user has interacted with. From a list of (user, item) rows, group by user and collect items into a set.

  • set(items) — wrap a list to get a deduplicated set.
  • defaultdict(set) — accumulate per-user sets in one pass.
  • liked[user].add(item) — append-or-create.
  • Set membershipitem in liked[user] is O(1).

Worked example. Build liked from interaction rows.

rows = [
    ('alice', 'pasta_palace'),
    ('alice', 'sushi_world'),
    ('bob',   'pasta_palace'),
    ('bob',   'tacos_y_tu'),
    ('alice', 'pasta_palace'),  # duplicate; set dedups
]
Enter fullscreen mode Exit fullscreen mode

Expected liked:

user set
alice {'pasta_palace', 'sushi_world'}
bob {'pasta_palace', 'tacos_y_tu'}

Worked-example solution.

from collections import defaultdict

liked = defaultdict(set)
for user, item in rows:
    liked[user].add(item)

print({k: v for k, v in liked.items()})
# {'alice': {'pasta_palace', 'sushi_world'}, 'bob': {'pasta_palace', 'tacos_y_tu'}}
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: defaultdict(set) is to per-key collections what defaultdict(int) is to per-key totals.

set_a & set_b intersection for shared favorites

The intersection invariant: a & b returns a new set containing only the items in both a and b. Time is O(min(len(a), len(b)))—the implementation iterates the smaller side and probes the larger.

  • a & b — operator form.
  • a.intersection(b) — method form; accepts any iterable.
  • a & b & c — chained intersection across more than two sets.
  • a.intersection(b, c, d) — varargs form for many sets.

Worked example. Intersect Alice's and Bob's likes.

set items
liked['alice'] {'pasta_palace', 'sushi_world'}
liked['bob'] {'pasta_palace', 'tacos_y_tu'}
liked['alice'] & liked['bob'] {'pasta_palace'}

Worked-example solution.

shared = liked['alice'] & liked['bob']
print(shared)            # {'pasta_palace'}
print(len(shared))       # 1
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if the question says "both" or "all of," reach for &; if it says "either" or "any of," reach for |.

Multi-set rollup: reduce(set.intersection, sets) and "new only" via difference

The composition invariant: N-way intersections fold cleanly with reduce; "new to user X" is set difference. The difference operation a - b returns items in a but not in b.

  • reduce(set.intersection, [a, b, c]) — items in all three sets.
  • set.intersection(*[a, b, c]) — equivalent; one call.
  • a - b — items in a but not in b.
  • a ^ b — symmetric difference (in either, but not both).

Worked example. Three users; find what all three like, plus "new to alice" given the group.

set items
alice {'pasta_palace', 'sushi_world'}
bob {'pasta_palace', 'tacos_y_tu'}
carol {'pasta_palace', 'sushi_world', 'kebab_king'}
query result
set.intersection(*sets) {'pasta_palace'}
`(alice \ bob \

Worked-example solution.
{% raw %}

from functools import reduce

sets = [liked['alice'], liked['bob'], liked['carol']]
common_to_all = reduce(set.intersection, sets)
print(common_to_all)     # {'pasta_palace'}

group_pool = liked['alice'] | liked['bob'] | liked['carol']
new_for_alice = group_pool - liked['alice']
print(new_for_alice)     # {'tacos_y_tu', 'kebab_king'}
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: most "recommend something" questions reduce to: take the union of the group, subtract the target user's already-known set, optionally rank by frequency.

Common beginner mistakes

  • Using list(set(...)) and assuming order is preserved—sets are unordered.
  • Using a for loop with if x in seen instead of set.intersection—correct but slow.
  • Building a list of (user, item) tuples and manually deduping—use set from the start.
  • Forgetting that set() literals are set(), not {}{} is an empty dict.
  • Using & on a list and a set—TypeError; convert the list to a set first.

Python Interview Question on Restaurant Recommendations

You are given a list of (user, restaurant) likes and two user IDs u1 and u2. Return a sorted list of restaurants that both users have liked. If either user has no likes, return an empty list.

def shared_restaurants(rows, u1, u2):
    ...
Enter fullscreen mode Exit fullscreen mode

Solution Using a defaultdict(set) index and a single intersection

from collections import defaultdict

def shared_restaurants(rows, u1, u2):
    liked = defaultdict(set)
    for user, restaurant in rows:
        liked[user].add(restaurant)
    return sorted(liked[u1] & liked[u2])
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace (input: 7 likes across 3 users, query u1='alice', u2='bob'):

user restaurant
alice pasta_palace
alice sushi_world
alice kebab_king
bob pasta_palace
bob kebab_king
bob tacos_y_tu
carol pasta_palace
  1. Initializeliked = defaultdict(set).
  2. Iterate rows — accumulate per-user sets:
    • liked['alice'] = {'pasta_palace', 'sushi_world', 'kebab_king'}
    • liked['bob'] = {'pasta_palace', 'kebab_king', 'tacos_y_tu'}
    • liked['carol'] = {'pasta_palace'}
  3. Intersectliked['alice'] & liked['bob'] = {'pasta_palace', 'kebab_king'}.
  4. sorted(...)['kebab_king', 'pasta_palace'] (alphabetic ascending).

Output:

return value
['kebab_king', 'pasta_palace']

Why this works — concept by concept:

  • defaultdict(set) — per-user sets accumulate without KeyError; add is the dedup-aware append.
  • set.addO(1) amortized; duplicates within the input are ignored automatically.
  • liked[u1] & liked[u2] — Python's set operator; iterates the smaller side and probes the larger; O(min(|s1|, |s2|)).
  • sorted(...) — converts the unordered set to a deterministic list; the prompt asks for sorted output.
  • Single-pass aggregation — one walk over the input builds the per-user index; the intersection then runs against finished sets.
  • CostO(R + min(|s1|, |s2|) + K log K) time, where R is total interaction rows and K = |s1 ∩ s2| for the final sort; O(R) space for the index.

PYTHON
Topic — set
Set problems

Practice →

PYTHON
PayPal — set
PayPal set problems

Practice →


5. Graphs and Sets for Bipartite Seating Validation

BFS 2-coloring on a conflict graph in Python for data engineering

"Given a list of conflict pairs, decide if the guests can be split into two tables with no conflicts at the same table" is the canonical bipartite-graph interview prompt. The mental model: build an adjacency dict from the pair list, BFS from each unvisited node, alternately color neighbors; if any edge ends up with both endpoints sharing a color, the graph is not bipartite. The same pattern decides every "two-team / two-shift / two-region" question whose constraint is "these pairs must be on opposite sides."

Diagram of a 5-node graph with conflict edges colored alternately as table A and table B via BFS, with one violation edge highlighted in red showing two same-color endpoints.

Pro tip: Always loop over every node in the graph, not just nodes[0]. The conflict graph may be disconnected (multiple isolated components), and a single BFS from one node only colors that component. The outer for node in graph: if node not in colored: ... loop is the canonical pattern.

Adjacency dict from a list of pair conflicts

The adjacency invariant: for an undirected conflict graph, every pair (u, v) adds v to adj[u] and u to adj[v]. The dict-of-sets form is the canonical Python structure—O(1) neighbor lookup, automatic dedup of duplicate edges.

  • adj = defaultdict(set) — per-node neighbor set.
  • adj[u].add(v); adj[v].add(u) — symmetric add for undirected.
  • adj[u].add(v) only — directed graph (different problem).
  • Use a set, not a list — duplicates and lookups are both faster.

Worked example. Pairs [(1, 2), (2, 3), (3, 1)] → triangle.

node neighbors
1 {2, 3}
2 {1, 3}
3 {1, 2}

Worked-example solution.

from collections import defaultdict

def build_adj(pairs):
    adj = defaultdict(set)
    for u, v in pairs:
        adj[u].add(v)
        adj[v].add(u)
    return adj

build_adj([(1, 2), (2, 3), (3, 1)])
# defaultdict(set, {1: {2, 3}, 2: {1, 3}, 3: {1, 2}})
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: always build the adjacency structure first; then BFS / DFS becomes a clean separate phase.

BFS / DFS 2-coloring: assign color, push neighbors with the opposite color

The coloring invariant: start from an arbitrary unvisited node, color it 0, then BFS pushing each neighbor with the opposite color. The colors dict maps node → 0 or 1; 0 = "Table A," 1 = "Table B."

  • from collections import deque — BFS queue.
  • q = deque([start]); colors[start] = 0 — seed.
  • Pop, then iterate neighbors — for each neighbor nb: if uncolored, set colors[nb] = 1 - colors[node], push to queue.
  • Outer loop over all nodesfor node in graph: if node not in colors: bfs(node) — handles disconnected components.

Worked example. A 4-node path graph 1 - 2 - 3 - 4. BFS from 1.

step queue popped colors after
0 [1] {1: 0}
1 [2] 1 {1: 0, 2: 1}
2 [3] 2 {1: 0, 2: 1, 3: 0}
3 [4] 3 {1: 0, 2: 1, 3: 0, 4: 1}

All edges connect 0 to 1—bipartite ✓.

Worked-example solution.

from collections import deque, defaultdict

def two_color(adj, start, colors):
    q = deque([start])
    colors[start] = 0
    while q:
        node = q.popleft()
        for nb in adj[node]:
            if nb not in colors:
                colors[nb] = 1 - colors[node]
                q.append(nb)
            elif colors[nb] == colors[node]:
                return False  # conflict
    return True
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: 1 - color is the clean "flip" idiom; don't reach for not color (which works on truthy values but is less explicit).

Conflict detection: same color on both ends of an edge → not bipartite

The validation invariant: during BFS, when you visit a neighbor that is already colored, check that its color is opposite; if it's the same, the graph is not bipartite. This is the only place the algorithm can return False.

  • Check colors[nb] == colors[node] — same color on both endpoints of the edge.
  • Return False immediately — short-circuit; no point in continuing.
  • All-neighbors-checked + outer-loop-completed = True — every component validated.
  • Self-loop edge (u, u) → automatically not bipartite (same color trivially).

Worked example. Triangle 1 - 2 - 3 - 1. BFS from 1.

step queue popped colors after conflict?
0 [1] {1: 0} no
1 [2, 3] 1 {1: 0, 2: 1, 3: 1} no
2 [3] 2 check neighbor 3: already colored 1 = colors[2] yes

Triangle is not bipartite.

Worked-example solution.

def is_bipartite(pairs):
    adj = defaultdict(set)
    for u, v in pairs:
        adj[u].add(v)
        adj[v].add(u)

    colors = {}
    for node in list(adj):
        if node in colors:
            continue
        if not two_color(adj, node, colors):
            return False
    return True

print(is_bipartite([(1, 2), (2, 3), (3, 1)]))  # False (triangle)
print(is_bipartite([(1, 2), (2, 3), (3, 4)]))  # True (path)
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: odd-length cycles always break bipartite-ness; even-length cycles are fine. Triangles, pentagons, heptagons—doomed. Squares, hexagons—safe.

Common beginner mistakes

  • BFS from only one node and assuming the graph is connected—miss other components.
  • Using a list for the queue and pop(0)O(N) per pop instead of O(1); use collections.deque.
  • Coloring with True / False—works but not c flips them; 0 / 1 is explicit.
  • Forgetting the symmetric add adj[v].add(u)—half the edges go missing.
  • Treating "no conflict at all" as "bipartite"—technically a graph with no edges is bipartite trivially, but most prompts want non-empty inputs.

Python Interview Question on Valid Seating Arrangement

You are given a list of guests and a list of "do not seat together" pairs. Decide whether the guests can be split into two tables such that no pair sits at the same table. Return True if a valid assignment exists, False otherwise.

def valid_seating(guests, conflicts):
    ...
Enter fullscreen mode Exit fullscreen mode

Solution Using BFS 2-coloring on the conflict graph

from collections import defaultdict, deque

def valid_seating(guests, conflicts):
    adj = defaultdict(set)
    for u, v in conflicts:
        adj[u].add(v)
        adj[v].add(u)

    colors = {}
    for start in guests:
        if start in colors:
            continue
        colors[start] = 0
        q = deque([start])
        while q:
            node = q.popleft()
            for nb in adj[node]:
                if nb not in colors:
                    colors[nb] = 1 - colors[node]
                    q.append(nb)
                elif colors[nb] == colors[node]:
                    return False
    return True
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace (input: guests = [1, 2, 3, 4, 5], conflicts = [(1, 2), (2, 3), (3, 4), (4, 5), (5, 1)] — a 5-cycle):

  1. Build adjacency:

| node | neighbors |
|-----:|-----------|
| 1 | {2, 5} |
| 2 | {1, 3} |
| 3 | {2, 4} |
| 4 | {3, 5} |
| 5 | {1, 4} |

  1. Outer loop — start at guest 1. colors = {1: 0}. Queue [1].
  2. Pop 1 — neighbors {2, 5}. Color 2 := 1, push. Color 5 := 1, push. colors = {1: 0, 2: 1, 5: 1}. Queue [2, 5].
  3. Pop 2 — neighbors {1, 3}. 1 already colored 01. Color 3 := 0, push. colors = {1: 0, 2: 1, 5: 1, 3: 0}. Queue [5, 3].
  4. Pop 5 — neighbors {1, 4}. 1 already colored 01. Color 4 := 0, push. colors = {1: 0, 2: 1, 5: 1, 3: 0, 4: 0}. Queue [3, 4].
  5. Pop 3 — neighbors {2, 4}. 2 already colored 10. Check 4: already colored 0 = colors[3] = 0conflict. Return False.

The 5-cycle is not bipartite (odd-length cycle), so seating fails.

Output:

return value
False

Why this works — concept by concept:

  • Adjacency dictdefaultdict(set) makes neighbor lookups O(1) and dedup-safe; the symmetric adj[u].add(v); adj[v].add(u) reflects the undirected-conflict semantics.
  • BFS queue (deque)O(1) popleft and append, the right structure for level-order traversal of the conflict graph.
  • 2-coloring with 1 - color — flips between 0 and 1 cleanly; both colors are integers so they are usable as dict values without surprises.
  • Outer loop over all guests — handles disconnected components by re-seeding BFS from any uncolored node.
  • Same-color edge check — the only failure mode; the moment a neighbor is already colored with the same value as the current node, the graph contains an odd-length cycle and bipartiteness is impossible.
  • Early return — once we detect a conflict, return immediately; no point exploring the rest.
  • CostO(V + E) time (each node and edge visited once), O(V) space for colors and the queue, where V = guests and E = conflict pairs.
PYTHON Topic — graph Graph problems

Practice →

PYTHON
Topic — set
Set problems

Practice →


Tips to Crack PayPal Data Engineering Interviews

Drill the five PayPal Python primitives, in order

The five Python primitives in the curated PayPal Python practice set are list comprehensions (#29), operator-semantic fluency (#30), defaultdict(int) aggregation (#98), set algebra (#289), and BFS 2-coloring on adjacency dicts (#290). Each maps to a specific module / idiom: comprehension syntax for #29, type-quirk awareness for #30, from collections import defaultdict for #98, set operators (&, |, -) for #289, from collections import deque, defaultdict for #290. Memorize the imports as muscle memory and the prompts become pattern recognition.

Drill operator semantics until they are reflexive

30 Python Type Operations Output is the make-or-break easy round at PayPal. Candidates who guess on 5 is 5.0 (False) or [1, 2] * 3 ([1, 2, 1, 2, 1, 2]) lose the round. Drill the type-handling practice page until you can predict every operator output without thinking. Cover small-int caching (is quirks), sequence concat with +, sequence replicate with *, the + * == is matrix across int / str / list / tuple, and the aliasing gotcha on [[]] * N.

Treat the bipartite-seating problem as your medium-tier ceiling

290 Valid Seating Arrangement is the hardest in the set. If you cannot fluently write the defaultdict(set) adjacency builder, the BFS 2-coloring loop, and the same-color conflict check, you cannot pass the PayPal medium round. Drill the graph practice page until the BFS template is muscle memory. The general "two-team / two-shift / two-region" pattern is the same algorithm.

SQL is not on this loop—skip it

The curated PayPal set is 5 Python, 0 SQL. Spending preparation hours on SQL window functions for a PayPal data-engineering interview is a misallocation. Spend it instead on Python comprehensions, operator semantics, defaultdict, set algebra, and graph BFS. If a SQL question appears, treat it as a different role family rather than the canonical loop covered in this guide.

Easy-tier discipline matters

Three of the five PayPal problems are EASY-tier. Easy at PayPal does not mean trivial; it means the interviewer expects zero hesitation, no off-by-ones, idiomatic Python, and an articulated invariant. A correct EASY answer with stuttering or one bug is graded worse than a correct MEDIUM answer with the same bug, because the EASY signal is supposed to be "this candidate writes Python fluently." The PayPal easy practice set is the right warmup.

Where to practice on PipeCode

Start with the PayPal practice page for the curated 5-problem set. After that, drill the matching topic pages: list-comprehension, type-handling, hash-table, set, graph. The interview courses page bundles the Python course if you want a structured curriculum before the company drills. For broad coverage, browse by topic.

Communication and approach under time pressure

Talk through the invariant first, the brute force second, and the optimal third—even if you go straight to the optimal in code. Interviewers grade process as much as the final answer. Leave 5 minutes at the end of each problem for an edge-case sweep: empty list, single element, all duplicates, isolated graph nodes, self-loops, mixed types. The most common "almost passed" failure mode is correct happy-path code that crashes on the empty input—a 30-second sweep prevents it.


Frequently Asked Questions

What is the PayPal data engineering interview process like?

PayPal's data engineering interview opens with a recruiter screen, then a technical phone screen with one Python coding problem, then an onsite (or virtual onsite) of four to five rounds: two Python coding rounds (comprehensions, type-operations, hash-table aggregation, set algebra, graph BFS), one data-modeling or system-design discussion (payment pipelines, merchant data, fraud signals), and one to two behavioral rounds. The coding rounds are heavy on the patterns covered in this guide.

What Python topics does PayPal test for data engineers?

PayPal Python interviews concentrate on five primitives that map directly to the curated 5-problem set: list comprehensions for filter-then-map array transforms (#29), operator semantics across int / str / list / tuple (#30), collections.defaultdict aggregation for per-key totals (#98), set algebra (&, |, -) for recommendations (#289), and BFS 2-coloring on adjacency dicts for bipartite validation (#290). Practice the imports as muscle memory: from collections import defaultdict, deque.

Does PayPal test SQL in data engineering interviews?

The curated PayPal practice set is 5 Python, 0 SQL—PayPal's data engineering interview is Python-only in the curated loop. If a SQL question appears, treat it as a different role family (analytics engineer, data scientist) rather than the canonical data-engineering loop covered here. Spending prep time on SQL window functions for a PayPal data-engineering interview is a misallocation; Python fluency is the entire signal.

How hard are PayPal data engineering interview questions?

The PayPal set is 3 easy + 2 medium + 0 hard. The EASY problems test list-comprehension fluency (#29), operator semantics (#30), and defaultdict aggregation (#98)—straightforward pattern recognition. The two MEDIUM problems are set-intersection recommendations (#289) and bipartite-graph BFS validation (#290); both are pattern-recognition once you have seen them. PayPal does not include a HARD-tier problem in this set, but seniority is graded on cleanliness and articulation, not just correctness.

Are list comprehensions and type-operations common in PayPal interviews?

Yes—both #29 List Comprehension Conversion and #30 Python Type Operations Output are explicit easy-tier prompts. Candidates who cannot fluently rewrite a for / append / if loop as [expr for x in iter if cond] lose the comprehension round; candidates who cannot predict 5 is 5.0 (False) or [1, 2] * 3 ([1, 2, 1, 2, 1, 2]) lose the type-operations round. Drill the list-comprehension practice page and the type-handling practice page.

How many PayPal practice problems should I solve before the interview?

Solve all 5 problems on the PayPal practice page end-to-end—untimed first, then timed at 25 minutes per problem. After that, broaden to 30 to 40 additional Python problems spread across the matching topic pages: list-comprehension, type-handling, hash-table, set, graph. The PayPal Python practice page and the PayPal medium practice page are the right surfaces for the final week of prep.


Start practicing PayPal data engineering problems

Top comments (0)