PayPal data engineering interview questions are Python end-to-end with a payments-analytics edge: five Python primitives (list comprehensions for filter-then-map array transformations, operator semantics across int / str / list / tuple for type-operation outputs, defaultdict(int) aggregation for per-key sales rollups, set algebra (&, |, -) for recommendation engines that surface shared favorites, and BFS-based 2-coloring on adjacency dicts for bipartite seating-arrangement validation). The framings are merchant and platform data—catering events, restaurant recommendations, seat assignments, and the everyday transformations a data engineer applies to per-row payment and merchant feeds.
This guide walks through the five topic clusters PayPal actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 5-problem PayPal set (3 easy, 2 medium, no hard)—a Python-fundamentals hub that rewards comprehension fluency, type-quirk awareness, and clean graph reasoning over algorithm-puzzle prep. There is no SQL in this loop; every minute of preparation should go to Python.
Top PayPal data engineering interview topics
From the PayPal data engineering practice set, the five numbered sections below follow this topic map (one row per H2):
| # | Topic (sections 1–5) | Why it shows up at PayPal |
|---|---|---|
| 1 | Python list comprehensions for array transformations | List Comprehension Conversion—rewrite a for / append / if loop as [expr for x in iter if cond] in one line. |
| 2 | Python type operations and operator semantics | Python Type Operations Output—predict + * == is results across int / str / list / tuple, including small-int caching and sequence concat. |
| 3 | Hash tables and defaultdict aggregation for sales reports |
Catering Sales Report—defaultdict(int) per-event totals, then sorted(..., key=lambda kv: -kv[1]) for ranked output. |
| 4 | Set intersection for recommendation engines | Restaurant Recommendations—per-user set of liked places, intersect to surface shared favorites, difference to filter "new only." |
| 5 | Graphs and sets for bipartite seating validation | Valid Seating Arrangement—adjacency dict from a list of conflicts, BFS 2-coloring, same-color edge means not bipartite. |
Payments-and-platform framing rule: PayPal's prompts span everyday data-engineering Python on payments and merchant data—filter-map array transforms, type-aware parsing of mixed inputs, per-key aggregation for revenue rollups, set-based recommendations, and graph integrity checks. The interviewer is grading whether you map each business framing to the right Python primitive: filter-then-map → list comprehension; type quirks → operator-semantic awareness; per-key totals →
defaultdict(int); "users who like both" → set intersection; "valid pairing on a conflict graph" → BFS bipartite check. State the mapping out loud.
1. Python List Comprehensions for Array Transformations
List comprehensions for filter-then-map array transforms in Python for data engineering
"Convert this for / append / if loop into a list comprehension" is the canonical Python-fluency interview prompt at PayPal. The mental model: a comprehension [expr for x in iter if cond] is a one-line filter-then-map pipeline. Every imperative for loop that builds a list with append (with optional if filtering) collapses into one comprehension. PayPal grades whether you can rewrite the imperative loop without changing semantics—same input, same output, same edge cases.
Pro tip: A comprehension and an
appendloop have the same time complexity but the comprehension is meaningfully faster (lower per-iteration overhead) and is what PayPal's interviewers expect to see when they ask "can you make this Pythonic?" Don't reach forfilter()+map()+list()—that is a 3-call chain that most reviewers will rewrite as a comprehension anyway.
Comprehension syntax: [expr for x in iter if cond]
The comprehension invariant: for each x in iter, if cond is truthy, evaluate expr and collect the result. Reading order: expr, then for x in iter, then if cond. Execution order: scan iter, filter by cond, transform with expr.
-
[expr for x in iter]— pure map; one output per input. -
[expr for x in iter if cond]— filter-then-map; outputs only wherecondis truthy. -
{expr for x in iter}— set comprehension; deduplicates results. -
{k: v for k, v in iter}— dict comprehension. -
(expr for x in iter)— generator expression; lazy, memory-friendly for large inputs.
Worked example. Convert this loop into a comprehension.
result = []
for x in [1, -2, 3, -4, 5]:
if x > 0:
result.append(x * x)
# result = [1, 9, 25]
Worked-example solution.
result = [x * x for x in [1, -2, 3, -4, 5] if x > 0]
# [1, 9, 25]
Rule of thumb: if the body of the loop is one if followed by one append, it is a comprehension. If the body has any other statement (a print, a counter, an early break), keep the loop.
Filter-map fusion vs separate filter + map calls
The fusion invariant: a comprehension does the filter and the map in one pass. Equivalent filter(...) and map(...) calls produce the same result but read worse and force an extra list(...) materialization in Python 3 (where both return iterators).
-
[expr for x in iter if cond]— one pass, one allocation, idiomatic. -
list(map(lambda x: expr, filter(lambda x: cond, iter)))— three function calls; harder to read. -
Generator expression —
sum(x*x for x in nums if x > 0)skips the intermediate list entirely. -
itertoolschains — for streaming pipelines without materializing.
Worked example. Three equivalent ways to compute the squares of positive numbers.
| approach | code | output |
|---|---|---|
| comprehension | [x*x for x in nums if x > 0] |
[1, 9, 25] |
filter + map
|
list(map(lambda x: x*x, filter(lambda x: x>0, nums))) |
[1, 9, 25] |
| generator (sum) | sum(x*x for x in nums if x > 0) |
35 |
Worked-example solution.
nums = [1, -2, 3, -4, 5]
# Pythonic
squares_pos = [x * x for x in nums if x > 0]
# Equivalent but verbose
squares_pos_alt = list(map(lambda x: x * x, filter(lambda x: x > 0, nums)))
assert squares_pos == squares_pos_alt == [1, 9, 25]
Rule of thumb: prefer comprehensions; reach for filter / map only when you already have named functions to plug in (e.g. list(map(int, raw_strs))).
Nested comprehensions and conditional expressions inside expr
The composition invariant: two transformations can live inside the same comprehension—the expr slot can be a conditional expression, and a second for clause introduces nesting. Used carefully, this expresses cross products and label assignments in one line; used carelessly, it produces unreadable code.
-
Conditional expression in
expr—[x if x >= 0 else -x for x in nums](clamp to absolute value). -
Two
forclauses —[(a, b) for a in xs for b in ys](Cartesian product; outer-loop-first reading order). -
Combined filter and conditional expression —
[("pos" if x > 0 else "zero") for x in nums if x >= 0]. -
Stop at one nest level — beyond two
forclauses, switch to a regular loop oritertools.product.
Worked example. Replace negatives with their absolute value, then keep only values whose absolute is ≤ 4.
nums = [-5, -3, 0, 2, 4, 7]
# Goal: [3, 0, 2, 4]
Worked-example solution.
result = [
abs(x)
for x in [-5, -3, 0, 2, 4, 7]
if abs(x) <= 4
]
# [3, 0, 2, 4]
Rule of thumb: if cond and expr can each call any function—including abs, int, str.lower—so most "transform-then-filter-then-transform" pipelines are still one comprehension.
Common beginner mistakes
- Using
filter()/map()then forgettinglist(...)in Python 3—returns an iterator, not a list. - Putting the
ifbefore thefor—syntax error in a comprehension (theifis a filter clause, not a top-level conditional). - Mutating the iterable inside the comprehension—undefined behavior; build into a fresh list instead.
- Writing a 3-level nested comprehension—correct but unreadable; use a regular loop.
- Forgetting that the leftmost
foris the outermost loop in nested comprehensions—reverse mental model and your output is transposed.
Python Interview Question on List Comprehension Conversion
Convert the loop below into a single list comprehension that produces the same output.
def positive_doubles(nums):
out = []
for n in nums:
if n > 0:
out.append(n * 2)
return out
The function takes a list of integers and returns a list of doubled values for positive entries only. Preserve the input order.
Solution Using a single [expr for x in iter if cond] comprehension
def positive_doubles(nums):
return [n * 2 for n in nums if n > 0]
Step-by-step trace (input: nums = [3, -1, 4, 0, -7, 2]):
| index | n |
n > 0? |
n * 2 |
output so far |
|---|---|---|---|---|
| 0 | 3 | yes | 6 | [6] |
| 1 | -1 | no | — | [6] |
| 2 | 4 | yes | 8 | [6, 8] |
| 3 | 0 | no | — | [6, 8] |
| 4 | -7 | no | — | [6, 8] |
| 5 | 2 | yes | 4 | [6, 8, 4] |
-
Iterate
numsin order; the comprehension is left-to-right. -
Filter
if n > 0—0and negatives drop;3, 4, 2pass. -
Map
n * 2— emit6, 8, 4in that order. -
Collect into a list — return
[6, 8, 4].
Output:
| return value |
|---|
[6, 8, 4] |
Why this works — concept by concept:
-
for n in nums— iterates the input in order; comprehensions preserve insertion order, so the output is order-stable. -
if n > 0filter clause — drops every value that doesn't satisfy the predicate before the expression runs;0and negatives never reachn * 2. -
n * 2expression — the only transformation applied to surviving values; lives in theexprslot of the comprehension. - One-pass evaluation — the comprehension iterates once, filters, and emits; no intermediate list of "all values" is built.
-
List allocation — Python pre-sizes the result list when it can predict the length, otherwise grows it amortized; either way the comprehension is faster than
appendin a loop. -
Cost —
O(N)time over the input length;O(K)space whereKis the number of positives.
PYTHON
Topic — list comprehension
List-comprehension problems
PYTHON
PayPal — array
PayPal array problems
2. Python Type Operations and Operator Semantics
Operator semantics across Python types for data engineering
"Predict the output of these operator expressions" is PayPal's signature Python-fluency screen. The mental model: +, *, ==, and is mean different things depending on operand types. + is numeric add for ints, sequence concat for strings/lists/tuples. * is numeric multiply for ints, sequence-replicate when one side is an int. == is value equality for everything. is is identity—only true when both names point at the same object in memory. Knowing how each operator dispatches is the difference between confidently predicting [1, 2] * 3 == [1, 2, 1, 2, 1, 2] and guessing.
Pro tip: Python caches small ints (typically
-5to256) soa is breturnsTruefora = 256; b = 256butFalsefora = 257; b = 257. Never useisfor value comparison—use==. Useisonly foris None,is True,is False, and identity checks against sentinel objects.
+ overloading: numeric add vs sequence concat
The + invariant: for numerics it adds; for sequences it concatenates; mixed types raise TypeError. The dispatch is based on the left operand's __add__ method.
-
int + int—5 + 3 == 8, numeric. -
str + str—"ab" + "cd" == "abcd", concat. -
list + list—[1, 2] + [3] == [1, 2, 3], concat. -
tuple + tuple—(1, 2) + (3,) == (1, 2, 3), concat (note the trailing comma for one-tuple). -
int + str—TypeError; mixed types do not implicitly convert.
Worked example. Predict the output for each line.
| expression | result |
|---|---|
1 + 2 |
3 |
"a" + "b" |
"ab" |
[1] + [2, 3] |
[1, 2, 3] |
(1,) + (2, 3) |
(1, 2, 3) |
"a" + 1 |
TypeError |
Worked-example solution.
print(1 + 2) # 3
print("a" + "b") # "ab"
print([1] + [2, 3]) # [1, 2, 3]
print((1,) + (2, 3)) # (1, 2, 3)
try:
print("a" + 1)
except TypeError as e:
print(f"TypeError: {e}")
# TypeError: can only concatenate str (not "int") to str
Rule of thumb: + always means "the same kind of thing on both sides"; mixing types means an explicit cast (str(1) or int("1")) before the operation.
* repeating: int repeat vs sequence-replicate vs Cartesian semantics
The * invariant: for numerics it multiplies; for sequence × int it replicates the sequence that many times; sequence × sequence raises TypeError. The replicate count must be a non-negative integer; zero or negative gives an empty sequence.
-
int * int—3 * 4 == 12, numeric. -
"ab" * 3—"ababab", replicate. -
[1, 2] * 3—[1, 2, 1, 2, 1, 2], replicate. -
[0] * 5—[0, 0, 0, 0, 0], common idiom for fixed-length zero list. -
[[]] * 3—[[], [], []]but all three inner lists are the same object; mutating one mutates all (gotcha).
Worked example. Predict each line.
| expression | result |
|---|---|
3 * 4 |
12 |
"ab" * 3 |
"ababab" |
[1, 2] * 3 |
[1, 2, 1, 2, 1, 2] |
[0] * 5 |
[0, 0, 0, 0, 0] |
[1, 2] * 0 |
[] |
[1, 2] * -3 |
[] (treated as zero) |
Worked-example solution.
print(3 * 4) # 12
print("ab" * 3) # "ababab"
print([1, 2] * 3) # [1, 2, 1, 2, 1, 2]
print([0] * 5) # [0, 0, 0, 0, 0]
print([1, 2] * 0) # []
# Aliasing gotcha
grid = [[]] * 3
grid[0].append("hi")
print(grid) # [['hi'], ['hi'], ['hi']] — all share the same inner list!
# Correct way to make 3 independent inner lists
grid_safe = [[] for _ in range(3)]
grid_safe[0].append("hi")
print(grid_safe) # [['hi'], [], []]
Rule of thumb: [X] * N is fine when X is immutable (int, str, tuple); for mutable inner objects use [copy.copy(X) for _ in range(N)] or a comprehension.
== value equality vs is identity (and small-int caching)
The equality invariant: == calls __eq__ and tests value; is tests whether two names refer to the exact same object. They usually agree for small ints and short strings (because of caching) and usually disagree for newly-constructed mutables.
-
5 == 5—True(value). -
5 is 5—Truefor small ints (cache); avoid relying on it. -
[1, 2] == [1, 2]—True(value). -
[1, 2] is [1, 2]—False(two distinct list objects). -
x is None— the canonical pattern;Noneis a singleton.
Worked example. Predict each.
| expression | result | reason |
|---|---|---|
5 == 5 |
True |
value |
5 is 5 |
True |
small-int cache |
1000 is 1000 |
usually True (literal in same compile unit) |
compiler folding |
a = 1000; b = 1000; a is b |
usually False (in CPython REPL) |
distinct allocations outside cache |
[1, 2] == [1, 2] |
True |
value |
[1, 2] is [1, 2] |
False |
distinct lists |
Worked-example solution.
print(5 == 5) # True
print(5 is 5) # True (small-int cache)
a = 1000
b = 1000
print(a == b) # True
print(a is b) # implementation-dependent: usually False outside the REPL
print([1, 2] == [1, 2]) # True
print([1, 2] is [1, 2]) # False — two distinct list objects
x = None
print(x is None) # True — the idiomatic check
Rule of thumb: use == for "same value"; use is only for is None, is True, is False, and identity comparisons against sentinel singletons.
Common beginner mistakes
- Using
isfor value equality—works for small ints, breaks elsewhere; always use==. - Assuming
[1, 2] + 3works—it doesn't;TypeError. - Building
[[]] * Nand mutating inner lists—aliasing bug. - Expecting
"abc" * 1.5to work—TypeError; replicate counts must beint. - Confusing string concat speed:
"" + s for ...isO(N²); use"".join(parts)forO(N).
Python Interview Question on Type Operations Output
Predict the output for each statement below in order. Then return them as a list of strings ("True", "False", "TypeError", or the printed value).
1 + 2
"a" + "b" * 3
[1] + [2, 3]
[1, 2] * 3
[1, 2] == [1, 2]
[1, 2] is [1, 2]
None is None
5 == 5.0
5 is 5.0
"" * 100
Solution Using a manual evaluation that respects operator precedence
results = [
"3", # 1 + 2
"abbb", # "a" + ("b" * 3) — * binds tighter than +
"[1, 2, 3]", # [1] + [2, 3]
"[1, 2, 1, 2, 1, 2]", # [1, 2] * 3
"True", # value equality on lists
"False", # distinct list objects
"True", # None is a singleton
"True", # 5 == 5.0 — Python compares numeric values across types
"False", # 5 is 5.0 — different types, different objects
"''", # empty string * 100 = empty string
]
Step-by-step trace:
-
1 + 2— int + int →3. -
"a" + "b" * 3—*has higher precedence than+, so this is"a" + ("b" * 3) = "a" + "bbb" = "abbb". -
[1] + [2, 3]— list concat →[1, 2, 3]. -
[1, 2] * 3— list replicate →[1, 2, 1, 2, 1, 2]. -
[1, 2] == [1, 2]— value equality on lists; both have the same contents →True. -
[1, 2] is [1, 2]— identity test; the two literal lists are distinct objects →False. -
None is None—Noneis a singleton; both names point at the same object →True. -
5 == 5.0— Python compares numeric values acrossintandfloat→True. -
5 is 5.0—5is an int object;5.0is a float object; different types →False. -
"" * 100— replicating an empty string any number of times yields the empty string →"".
Output:
| line | result |
|---|---|
| 1 | 3 |
| 2 | "abbb" |
| 3 | [1, 2, 3] |
| 4 | [1, 2, 1, 2, 1, 2] |
| 5 | True |
| 6 | False |
| 7 | True |
| 8 | True |
| 9 | False |
| 10 | "" |
Why this works — concept by concept:
-
Operator precedence —
*binds tighter than+, so"a" + "b" * 3reads as"a" + ("b" * 3)and the result is"abbb", not"ababab". -
Polymorphic
+— int + int adds, sequence + sequence concatenates; the dispatch is based on the operand types. -
Polymorphic
*— int * int multiplies; sequence * int replicates. -
List value equality —
==on lists compares element-by-element, so[1, 2] == [1, 2]isTrueeven when the two lists are distinct objects. -
List identity —
istests whether two names point at the same object; two list literals on different lines are different objects, soisisFalse. -
Nonesingleton — there is only oneNonein a Python program;is Noneis the canonical null check. -
Cross-type numeric
==— Python upcasts5to5.0for the comparison, returningTrue;isdoes no such conversion and returnsFalsebecause the types differ. -
Empty-string replicate —
"" * Nis empty for anyN; multiplying nothing by any count is still nothing. -
Cost — every operation is
O(1)in the size of these literals; the trace itself is constant-time.
PYTHON
Topic — type handling
Type-handling problems
PYTHON
Topic — list comprehension
List-comprehension problems
3. Hash Tables and defaultdict Aggregation for Sales Reports
defaultdict aggregation for per-key totals in Python for data engineering
"Build a sales report grouped by event type" is the canonical hash-table aggregation prompt for PayPal's catering / merchant flows. The mental model: defaultdict(int) lets you do counts[key] += amount without checking for key existence; the dict accumulates per-key totals in one pass; a final sorted(...) step ranks the output. This is the everyday DE pattern for any per-entity rollup—per-merchant fees, per-category revenue, per-cohort signups.
Pro tip: If the value field is an
inttotal,defaultdict(int)is right; if you need a sum and a count,defaultdict(lambda: [0, 0])(or a small dataclass) is cleaner than two parallel dicts. UseCounteronly when the input is a flat iterable of keys—not when each row carries a separate amount.
defaultdict(int) for per-key totals
The defaultdict invariant: defaultdict(factory) returns the factory's value when a key is missing. With factory int, missing keys default to 0, so += always works without a KeyError. The structure: event_type → running total.
-
from collections import defaultdict— the import. -
totals = defaultdict(int)— declare with factoryint(returns0on missing key). -
totals[k] += amount— accumulate; missing key auto-initializes to0. -
dict(totals)— convert to a plain dict for return / serialization (optional but idiomatic).
Worked example. Three catering rows: ('wedding', 200), ('birthday', 80), ('wedding', 150).
| step | row | totals before | action | totals after |
|---|---|---|---|---|
| 1 | ('wedding', 200) |
{} |
totals['wedding'] += 200 |
{'wedding': 200} |
| 2 | ('birthday', 80) |
{'wedding': 200} |
totals['birthday'] += 80 |
{'wedding': 200, 'birthday': 80} |
| 3 | ('wedding', 150) |
{'wedding': 200, 'birthday': 80} |
totals['wedding'] += 150 |
{'wedding': 350, 'birthday': 80} |
Worked-example solution.
from collections import defaultdict
rows = [('wedding', 200), ('birthday', 80), ('wedding', 150)]
totals = defaultdict(int)
for event, amount in rows:
totals[event] += amount
print(dict(totals))
# {'wedding': 350, 'birthday': 80}
Rule of thumb: if your loop body is if k not in d: d[k] = 0; d[k] += v, that is exactly what defaultdict(int) automates—replace it.
dict.items() + sorted(..., key=lambda kv: -kv[1]) for ranked output
The ranking invariant: sorted returns a new list of (key, value) tuples ordered by the key function. For descending revenue, sort by negative amount; alternatively, pass reverse=True.
-
sorted(d.items(), key=lambda kv: -kv[1])— descending by value (negate). -
sorted(d.items(), key=lambda kv: kv[1], reverse=True)— equivalent; arguably clearer. -
Tie-break by key —
sorted(d.items(), key=lambda kv: (-kv[1], kv[0]))(descending value, ascending key on ties). -
Top-N — slice with
[:N]after sorting, or useheapq.nlargest(N, d.items(), key=lambda kv: kv[1])forO(M log N).
Worked example. Continuing the previous totals.
| sort key | output |
|---|---|
key=lambda kv: -kv[1] |
[('wedding', 350), ('birthday', 80)] |
key=lambda kv: kv[1], reverse=True |
[('wedding', 350), ('birthday', 80)] |
key=lambda kv: (-kv[1], kv[0]) |
[('wedding', 350), ('birthday', 80)] |
Worked-example solution.
ranked = sorted(totals.items(), key=lambda kv: (-kv[1], kv[0]))
print(ranked)
# [('wedding', 350), ('birthday', 80)]
# Top 1 (most-revenue event)
print(ranked[0])
# ('wedding', 350)
Rule of thumb: always add a tie-break key; otherwise the sort is non-deterministic when two values are equal, and tests flake.
Multi-field aggregation: nested dicts vs (key1, key2) tuple keys
The composition invariant: for two-dimensional aggregates (event × city), use a tuple key (event, city) or a nested dict {event: {city: total}}. Tuple keys read more cleanly when both fields are always present; nested dicts let you ask "all cities for this event" in one lookup.
-
Tuple key —
totals[(event, city)] += amount. -
Nested dict —
totals[event][city] += amount; needdefaultdict(lambda: defaultdict(int)). -
Pandas alternative —
df.groupby(['event', 'city'])['amount'].sum(); same idea, vectorized. - Choose tuple keys for serialization to JSON; choose nested for "per-event sub-rollup" queries.
Worked example. Aggregate by (event, city).
rows = [
('wedding', 'NY', 200),
('birthday', 'SF', 80),
('wedding', 'NY', 150),
('wedding', 'SF', 100),
]
Expected:
| (event, city) | total |
|---|---|
| (wedding, NY) | 350 |
| (birthday, SF) | 80 |
| (wedding, SF) | 100 |
Worked-example solution.
from collections import defaultdict
totals = defaultdict(int)
for event, city, amount in rows:
totals[(event, city)] += amount
for k, v in sorted(totals.items(), key=lambda kv: (-kv[1], kv[0])):
print(k, v)
# ('wedding', 'NY') 350
# ('wedding', 'SF') 100
# ('birthday', 'SF') 80
Rule of thumb: tuple keys are fine until you need "all cities for one event"; at that point, switch to nested dicts.
Common beginner mistakes
- Using
dict[k] += vwithoutdefaultdict—KeyErroron the first row per key. - Using
Counterwhen each row has a non-uniform amount—Counteronly counts presence, not sums weighted amounts. - Sorting by
-kv[1]with string values—TypeErroron negation; usereverse=True. - Forgetting
dict()cast when serializing to JSON—defaultdictis JSON-serializable but the type round-trips weirdly. - Mixing tuple and nested-dict aggregations in the same pipeline—pick one and stick with it.
Python Interview Question on a Catering Sales Report
You are given a list of catering sales rows of the form (event_type: str, amount: int). Return a list of (event_type, total) tuples sorted by total descending, then by event_type ascending on ties.
def catering_sales_report(rows):
...
Solution Using defaultdict(int) and a tuple-keyed sorted(...)
from collections import defaultdict
def catering_sales_report(rows):
totals = defaultdict(int)
for event, amount in rows:
totals[event] += amount
return sorted(totals.items(), key=lambda kv: (-kv[1], kv[0]))
Step-by-step trace (input: 6 rows below):
| event_type | amount |
|---|---|
| wedding | 200 |
| birthday | 80 |
| wedding | 150 |
| corporate | 300 |
| birthday | 40 |
| wedding | 50 |
-
Initialize —
totals = defaultdict(int)(empty, default factoryint). -
Row 1 —
totals['wedding'] += 200→{'wedding': 200}. -
Row 2 —
totals['birthday'] += 80→{'wedding': 200, 'birthday': 80}. -
Row 3 —
totals['wedding'] += 150→{'wedding': 350, 'birthday': 80}. -
Row 4 —
totals['corporate'] += 300→{'wedding': 350, 'birthday': 80, 'corporate': 300}. -
Row 5 —
totals['birthday'] += 40→{'wedding': 350, 'birthday': 120, 'corporate': 300}. -
Row 6 —
totals['wedding'] += 50→{'wedding': 400, 'birthday': 120, 'corporate': 300}. -
sorted(totals.items(), key=lambda kv: (-kv[1], kv[0]))— descending by total, ascending by event on ties:[('wedding', 400), ('corporate', 300), ('birthday', 120)].
Output:
| event_type | total |
|---|---|
| wedding | 400 |
| corporate | 300 |
| birthday | 120 |
Why this works — concept by concept:
-
defaultdict(int)— supplies a0baseline for any missing key, sototals[event] += amountworks on the first occurrence without a manual existence check. -
+=accumulation — adds the current row's amount onto the running total; idempotent across repeated keys. -
dict.items()— produces the(key, value)pairs we feed intosorted. -
Tuple sort key —
(-kv[1], kv[0])sorts primarily by descending total (negate the amount) and breaks ties by ascending event name; deterministic. -
One-pass + one-sort — aggregation is one linear pass over the rows; the sort is
O(K log K)whereKis the number of distinct event types, typically tiny. -
O(N + K log K)total cost —Nrows aggregated,Kdistinct keys sorted; in practice the sort is negligible.
PYTHON
Topic — hash table
Hash-table problems
PYTHON
PayPal — hash table
PayPal hash-table problems
4. Set Intersection for Recommendation Engines
Set algebra for recommendation queries in Python for data engineering
"Find restaurants both users like" is the canonical set-intersection interview prompt. The mental model: build a set of liked items per user; intersect (&) the sets to surface shared favorites; union (|) for combined preferences; difference (-) for "user A's picks that user B has not seen". Set algebra is O(min(len(a), len(b))) for intersection, which is why every recommendation pipeline reaches for sets before it reaches for nested loops.
Pro tip: When you have N user sets and need the items everyone likes, use
reduce(set.intersection, sets)orset.intersection(*sets)—both fold across all sets in one pass. Don't manually loop withresult &= s; thereduceform is one line, idiomatic, and avoids a redundant copy.
Building per-user sets from interaction lists
The construction invariant: a per-user set is a deduplicated collection of items that user has interacted with. From a list of (user, item) rows, group by user and collect items into a set.
-
set(items)— wrap a list to get a deduplicated set. -
defaultdict(set)— accumulate per-user sets in one pass. -
liked[user].add(item)— append-or-create. -
Set membership —
item in liked[user]isO(1).
Worked example. Build liked from interaction rows.
rows = [
('alice', 'pasta_palace'),
('alice', 'sushi_world'),
('bob', 'pasta_palace'),
('bob', 'tacos_y_tu'),
('alice', 'pasta_palace'), # duplicate; set dedups
]
Expected liked:
| user | set |
|---|---|
| alice | {'pasta_palace', 'sushi_world'} |
| bob | {'pasta_palace', 'tacos_y_tu'} |
Worked-example solution.
from collections import defaultdict
liked = defaultdict(set)
for user, item in rows:
liked[user].add(item)
print({k: v for k, v in liked.items()})
# {'alice': {'pasta_palace', 'sushi_world'}, 'bob': {'pasta_palace', 'tacos_y_tu'}}
Rule of thumb: defaultdict(set) is to per-key collections what defaultdict(int) is to per-key totals.
set_a & set_b intersection for shared favorites
The intersection invariant: a & b returns a new set containing only the items in both a and b. Time is O(min(len(a), len(b)))—the implementation iterates the smaller side and probes the larger.
-
a & b— operator form. -
a.intersection(b)— method form; accepts any iterable. -
a & b & c— chained intersection across more than two sets. -
a.intersection(b, c, d)— varargs form for many sets.
Worked example. Intersect Alice's and Bob's likes.
| set | items |
|---|---|
liked['alice'] |
{'pasta_palace', 'sushi_world'} |
liked['bob'] |
{'pasta_palace', 'tacos_y_tu'} |
liked['alice'] & liked['bob'] |
{'pasta_palace'} |
Worked-example solution.
shared = liked['alice'] & liked['bob']
print(shared) # {'pasta_palace'}
print(len(shared)) # 1
Rule of thumb: if the question says "both" or "all of," reach for &; if it says "either" or "any of," reach for |.
Multi-set rollup: reduce(set.intersection, sets) and "new only" via difference
The composition invariant: N-way intersections fold cleanly with reduce; "new to user X" is set difference. The difference operation a - b returns items in a but not in b.
-
reduce(set.intersection, [a, b, c])— items in all three sets. -
set.intersection(*[a, b, c])— equivalent; one call. -
a - b— items inabut not inb. -
a ^ b— symmetric difference (in either, but not both).
Worked example. Three users; find what all three like, plus "new to alice" given the group.
| set | items |
|---|---|
| alice | {'pasta_palace', 'sushi_world'} |
| bob | {'pasta_palace', 'tacos_y_tu'} |
| carol | {'pasta_palace', 'sushi_world', 'kebab_king'} |
| query | result |
|---|---|
set.intersection(*sets) |
{'pasta_palace'} |
| `(alice \ | bob \ |
Worked-example solution.
{% raw %}
from functools import reduce
sets = [liked['alice'], liked['bob'], liked['carol']]
common_to_all = reduce(set.intersection, sets)
print(common_to_all) # {'pasta_palace'}
group_pool = liked['alice'] | liked['bob'] | liked['carol']
new_for_alice = group_pool - liked['alice']
print(new_for_alice) # {'tacos_y_tu', 'kebab_king'}
Rule of thumb: most "recommend something" questions reduce to: take the union of the group, subtract the target user's already-known set, optionally rank by frequency.
Common beginner mistakes
- Using
list(set(...))and assuming order is preserved—sets are unordered. - Using a
forloop withif x in seeninstead ofset.intersection—correct but slow. - Building a list of (user, item) tuples and manually deduping—use
setfrom the start. - Forgetting that
set()literals areset(), not{}—{}is an empty dict. - Using
&on a list and a set—TypeError; convert the list to a set first.
Python Interview Question on Restaurant Recommendations
You are given a list of (user, restaurant) likes and two user IDs u1 and u2. Return a sorted list of restaurants that both users have liked. If either user has no likes, return an empty list.
def shared_restaurants(rows, u1, u2):
...
Solution Using a defaultdict(set) index and a single intersection
from collections import defaultdict
def shared_restaurants(rows, u1, u2):
liked = defaultdict(set)
for user, restaurant in rows:
liked[user].add(restaurant)
return sorted(liked[u1] & liked[u2])
Step-by-step trace (input: 7 likes across 3 users, query u1='alice', u2='bob'):
| user | restaurant |
|---|---|
| alice | pasta_palace |
| alice | sushi_world |
| alice | kebab_king |
| bob | pasta_palace |
| bob | kebab_king |
| bob | tacos_y_tu |
| carol | pasta_palace |
-
Initialize —
liked = defaultdict(set). -
Iterate rows — accumulate per-user sets:
liked['alice'] = {'pasta_palace', 'sushi_world', 'kebab_king'}liked['bob'] = {'pasta_palace', 'kebab_king', 'tacos_y_tu'}liked['carol'] = {'pasta_palace'}
-
Intersect —
liked['alice'] & liked['bob'] = {'pasta_palace', 'kebab_king'}. -
sorted(...)—['kebab_king', 'pasta_palace'](alphabetic ascending).
Output:
| return value |
|---|
['kebab_king', 'pasta_palace'] |
Why this works — concept by concept:
-
defaultdict(set)— per-user sets accumulate withoutKeyError;addis the dedup-aware append. -
set.add—O(1)amortized; duplicates within the input are ignored automatically. -
liked[u1] & liked[u2]— Python's set operator; iterates the smaller side and probes the larger;O(min(|s1|, |s2|)). -
sorted(...)— converts the unordered set to a deterministic list; the prompt asks for sorted output. - Single-pass aggregation — one walk over the input builds the per-user index; the intersection then runs against finished sets.
-
Cost —
O(R + min(|s1|, |s2|) + K log K)time, whereRis total interaction rows andK = |s1 ∩ s2|for the final sort;O(R)space for the index.
PYTHON
Topic — set
Set problems
PYTHON
PayPal — set
PayPal set problems
5. Graphs and Sets for Bipartite Seating Validation
BFS 2-coloring on a conflict graph in Python for data engineering
"Given a list of conflict pairs, decide if the guests can be split into two tables with no conflicts at the same table" is the canonical bipartite-graph interview prompt. The mental model: build an adjacency dict from the pair list, BFS from each unvisited node, alternately color neighbors; if any edge ends up with both endpoints sharing a color, the graph is not bipartite. The same pattern decides every "two-team / two-shift / two-region" question whose constraint is "these pairs must be on opposite sides."
Pro tip: Always loop over every node in the graph, not just
nodes[0]. The conflict graph may be disconnected (multiple isolated components), and a single BFS from one node only colors that component. The outerfor node in graph: if node not in colored: ...loop is the canonical pattern.
Adjacency dict from a list of pair conflicts
The adjacency invariant: for an undirected conflict graph, every pair (u, v) adds v to adj[u] and u to adj[v]. The dict-of-sets form is the canonical Python structure—O(1) neighbor lookup, automatic dedup of duplicate edges.
-
adj = defaultdict(set)— per-node neighbor set. -
adj[u].add(v); adj[v].add(u)— symmetric add for undirected. -
adj[u].add(v)only — directed graph (different problem). - Use a set, not a list — duplicates and lookups are both faster.
Worked example. Pairs [(1, 2), (2, 3), (3, 1)] → triangle.
| node | neighbors |
|---|---|
| 1 | {2, 3} |
| 2 | {1, 3} |
| 3 | {1, 2} |
Worked-example solution.
from collections import defaultdict
def build_adj(pairs):
adj = defaultdict(set)
for u, v in pairs:
adj[u].add(v)
adj[v].add(u)
return adj
build_adj([(1, 2), (2, 3), (3, 1)])
# defaultdict(set, {1: {2, 3}, 2: {1, 3}, 3: {1, 2}})
Rule of thumb: always build the adjacency structure first; then BFS / DFS becomes a clean separate phase.
BFS / DFS 2-coloring: assign color, push neighbors with the opposite color
The coloring invariant: start from an arbitrary unvisited node, color it 0, then BFS pushing each neighbor with the opposite color. The colors dict maps node → 0 or 1; 0 = "Table A," 1 = "Table B."
-
from collections import deque— BFS queue. -
q = deque([start]); colors[start] = 0— seed. -
Pop, then iterate neighbors — for each neighbor
nb: if uncolored, setcolors[nb] = 1 - colors[node], push to queue. -
Outer loop over all nodes —
for node in graph: if node not in colors: bfs(node)— handles disconnected components.
Worked example. A 4-node path graph 1 - 2 - 3 - 4. BFS from 1.
| step | queue | popped | colors after |
|---|---|---|---|
| 0 | [1] |
— | {1: 0} |
| 1 | [2] |
1 |
{1: 0, 2: 1} |
| 2 | [3] |
2 |
{1: 0, 2: 1, 3: 0} |
| 3 | [4] |
3 |
{1: 0, 2: 1, 3: 0, 4: 1} |
All edges connect 0 to 1—bipartite ✓.
Worked-example solution.
from collections import deque, defaultdict
def two_color(adj, start, colors):
q = deque([start])
colors[start] = 0
while q:
node = q.popleft()
for nb in adj[node]:
if nb not in colors:
colors[nb] = 1 - colors[node]
q.append(nb)
elif colors[nb] == colors[node]:
return False # conflict
return True
Rule of thumb: 1 - color is the clean "flip" idiom; don't reach for not color (which works on truthy values but is less explicit).
Conflict detection: same color on both ends of an edge → not bipartite
The validation invariant: during BFS, when you visit a neighbor that is already colored, check that its color is opposite; if it's the same, the graph is not bipartite. This is the only place the algorithm can return False.
-
Check
colors[nb] == colors[node]— same color on both endpoints of the edge. -
Return
Falseimmediately — short-circuit; no point in continuing. -
All-neighbors-checked + outer-loop-completed =
True— every component validated. -
Self-loop edge
(u, u)→ automatically not bipartite (same color trivially).
Worked example. Triangle 1 - 2 - 3 - 1. BFS from 1.
| step | queue | popped | colors after | conflict? |
|---|---|---|---|---|
| 0 | [1] |
— | {1: 0} |
no |
| 1 | [2, 3] |
1 |
{1: 0, 2: 1, 3: 1} |
no |
| 2 | [3] |
2 |
check neighbor 3: already colored 1 = colors[2]
|
yes |
Triangle is not bipartite.
Worked-example solution.
def is_bipartite(pairs):
adj = defaultdict(set)
for u, v in pairs:
adj[u].add(v)
adj[v].add(u)
colors = {}
for node in list(adj):
if node in colors:
continue
if not two_color(adj, node, colors):
return False
return True
print(is_bipartite([(1, 2), (2, 3), (3, 1)])) # False (triangle)
print(is_bipartite([(1, 2), (2, 3), (3, 4)])) # True (path)
Rule of thumb: odd-length cycles always break bipartite-ness; even-length cycles are fine. Triangles, pentagons, heptagons—doomed. Squares, hexagons—safe.
Common beginner mistakes
- BFS from only one node and assuming the graph is connected—miss other components.
- Using a list for the queue and
pop(0)—O(N)per pop instead ofO(1); usecollections.deque. - Coloring with
True / False—works butnot cflips them;0 / 1is explicit. - Forgetting the symmetric add
adj[v].add(u)—half the edges go missing. - Treating "no conflict at all" as "bipartite"—technically a graph with no edges is bipartite trivially, but most prompts want non-empty inputs.
Python Interview Question on Valid Seating Arrangement
You are given a list of guests and a list of "do not seat together" pairs. Decide whether the guests can be split into two tables such that no pair sits at the same table. Return True if a valid assignment exists, False otherwise.
def valid_seating(guests, conflicts):
...
Solution Using BFS 2-coloring on the conflict graph
from collections import defaultdict, deque
def valid_seating(guests, conflicts):
adj = defaultdict(set)
for u, v in conflicts:
adj[u].add(v)
adj[v].add(u)
colors = {}
for start in guests:
if start in colors:
continue
colors[start] = 0
q = deque([start])
while q:
node = q.popleft()
for nb in adj[node]:
if nb not in colors:
colors[nb] = 1 - colors[node]
q.append(nb)
elif colors[nb] == colors[node]:
return False
return True
Step-by-step trace (input: guests = [1, 2, 3, 4, 5], conflicts = [(1, 2), (2, 3), (3, 4), (4, 5), (5, 1)] — a 5-cycle):
- Build adjacency:
| node | neighbors |
|-----:|-----------|
| 1 | {2, 5} |
| 2 | {1, 3} |
| 3 | {2, 4} |
| 4 | {3, 5} |
| 5 | {1, 4} |
-
Outer loop — start at guest
1.colors = {1: 0}. Queue[1]. -
Pop
1— neighbors{2, 5}. Color2 := 1, push. Color5 := 1, push.colors = {1: 0, 2: 1, 5: 1}. Queue[2, 5]. -
Pop
2— neighbors{1, 3}.1already colored0≠1. Color3 := 0, push.colors = {1: 0, 2: 1, 5: 1, 3: 0}. Queue[5, 3]. -
Pop
5— neighbors{1, 4}.1already colored0≠1. Color4 := 0, push.colors = {1: 0, 2: 1, 5: 1, 3: 0, 4: 0}. Queue[3, 4]. -
Pop
3— neighbors{2, 4}.2already colored1≠0. Check4: already colored0=colors[3] = 0→ conflict. ReturnFalse.
The 5-cycle is not bipartite (odd-length cycle), so seating fails.
Output:
| return value |
|---|
False |
Why this works — concept by concept:
-
Adjacency dict —
defaultdict(set)makes neighbor lookupsO(1)and dedup-safe; the symmetricadj[u].add(v); adj[v].add(u)reflects the undirected-conflict semantics. -
BFS queue (
deque) —O(1)popleftandappend, the right structure for level-order traversal of the conflict graph. -
2-coloring with
1 - color— flips between0and1cleanly; both colors are integers so they are usable as dict values without surprises. - Outer loop over all guests — handles disconnected components by re-seeding BFS from any uncolored node.
- Same-color edge check — the only failure mode; the moment a neighbor is already colored with the same value as the current node, the graph contains an odd-length cycle and bipartiteness is impossible.
- Early return — once we detect a conflict, return immediately; no point exploring the rest.
-
Cost —
O(V + E)time (each node and edge visited once),O(V)space forcolorsand the queue, whereV= guests andE= conflict pairs.
PYTHON
Topic — set
Set problems
Tips to Crack PayPal Data Engineering Interviews
Drill the five PayPal Python primitives, in order
The five Python primitives in the curated PayPal Python practice set are list comprehensions (#29), operator-semantic fluency (#30), defaultdict(int) aggregation (#98), set algebra (#289), and BFS 2-coloring on adjacency dicts (#290). Each maps to a specific module / idiom: comprehension syntax for #29, type-quirk awareness for #30, from collections import defaultdict for #98, set operators (&, |, -) for #289, from collections import deque, defaultdict for #290. Memorize the imports as muscle memory and the prompts become pattern recognition.
Drill operator semantics until they are reflexive
30 Python Type Operations Output is the make-or-break easy round at PayPal. Candidates who guess on 5 is 5.0 (False) or [1, 2] * 3 ([1, 2, 1, 2, 1, 2]) lose the round. Drill the type-handling practice page until you can predict every operator output without thinking. Cover small-int caching (is quirks), sequence concat with +, sequence replicate with *, the + * == is matrix across int / str / list / tuple, and the aliasing gotcha on [[]] * N.
Treat the bipartite-seating problem as your medium-tier ceiling
290 Valid Seating Arrangement is the hardest in the set. If you cannot fluently write the defaultdict(set) adjacency builder, the BFS 2-coloring loop, and the same-color conflict check, you cannot pass the PayPal medium round. Drill the graph practice page until the BFS template is muscle memory. The general "two-team / two-shift / two-region" pattern is the same algorithm.
SQL is not on this loop—skip it
The curated PayPal set is 5 Python, 0 SQL. Spending preparation hours on SQL window functions for a PayPal data-engineering interview is a misallocation. Spend it instead on Python comprehensions, operator semantics, defaultdict, set algebra, and graph BFS. If a SQL question appears, treat it as a different role family rather than the canonical loop covered in this guide.
Easy-tier discipline matters
Three of the five PayPal problems are EASY-tier. Easy at PayPal does not mean trivial; it means the interviewer expects zero hesitation, no off-by-ones, idiomatic Python, and an articulated invariant. A correct EASY answer with stuttering or one bug is graded worse than a correct MEDIUM answer with the same bug, because the EASY signal is supposed to be "this candidate writes Python fluently." The PayPal easy practice set is the right warmup.
Where to practice on PipeCode
Start with the PayPal practice page for the curated 5-problem set. After that, drill the matching topic pages: list-comprehension, type-handling, hash-table, set, graph. The interview courses page bundles the Python course if you want a structured curriculum before the company drills. For broad coverage, browse by topic.
Communication and approach under time pressure
Talk through the invariant first, the brute force second, and the optimal third—even if you go straight to the optimal in code. Interviewers grade process as much as the final answer. Leave 5 minutes at the end of each problem for an edge-case sweep: empty list, single element, all duplicates, isolated graph nodes, self-loops, mixed types. The most common "almost passed" failure mode is correct happy-path code that crashes on the empty input—a 30-second sweep prevents it.
Frequently Asked Questions
What is the PayPal data engineering interview process like?
PayPal's data engineering interview opens with a recruiter screen, then a technical phone screen with one Python coding problem, then an onsite (or virtual onsite) of four to five rounds: two Python coding rounds (comprehensions, type-operations, hash-table aggregation, set algebra, graph BFS), one data-modeling or system-design discussion (payment pipelines, merchant data, fraud signals), and one to two behavioral rounds. The coding rounds are heavy on the patterns covered in this guide.
What Python topics does PayPal test for data engineers?
PayPal Python interviews concentrate on five primitives that map directly to the curated 5-problem set: list comprehensions for filter-then-map array transforms (#29), operator semantics across int / str / list / tuple (#30), collections.defaultdict aggregation for per-key totals (#98), set algebra (&, |, -) for recommendations (#289), and BFS 2-coloring on adjacency dicts for bipartite validation (#290). Practice the imports as muscle memory: from collections import defaultdict, deque.
Does PayPal test SQL in data engineering interviews?
The curated PayPal practice set is 5 Python, 0 SQL—PayPal's data engineering interview is Python-only in the curated loop. If a SQL question appears, treat it as a different role family (analytics engineer, data scientist) rather than the canonical data-engineering loop covered here. Spending prep time on SQL window functions for a PayPal data-engineering interview is a misallocation; Python fluency is the entire signal.
How hard are PayPal data engineering interview questions?
The PayPal set is 3 easy + 2 medium + 0 hard. The EASY problems test list-comprehension fluency (#29), operator semantics (#30), and defaultdict aggregation (#98)—straightforward pattern recognition. The two MEDIUM problems are set-intersection recommendations (#289) and bipartite-graph BFS validation (#290); both are pattern-recognition once you have seen them. PayPal does not include a HARD-tier problem in this set, but seniority is graded on cleanliness and articulation, not just correctness.
Are list comprehensions and type-operations common in PayPal interviews?
Yes—both #29 List Comprehension Conversion and #30 Python Type Operations Output are explicit easy-tier prompts. Candidates who cannot fluently rewrite a for / append / if loop as [expr for x in iter if cond] lose the comprehension round; candidates who cannot predict 5 is 5.0 (False) or [1, 2] * 3 ([1, 2, 1, 2, 1, 2]) lose the type-operations round. Drill the list-comprehension practice page and the type-handling practice page.
How many PayPal practice problems should I solve before the interview?
Solve all 5 problems on the PayPal practice page end-to-end—untimed first, then timed at 25 minutes per problem. After that, broaden to 30 to 40 additional Python problems spread across the matching topic pages: list-comprehension, type-handling, hash-table, set, graph. The PayPal Python practice page and the PayPal medium practice page are the right surfaces for the final week of prep.



Top comments (0)