Gowtham Potureddi

Posted on Apr 28

Databricks Data Engineering Interview Questions

#python #sql #interview #dataengineering

Databricks data engineering interview questions are unusually algorithm-heavy for a DE loop. Expect mostly Python: array sorting and pairwise patterns, interval algorithms with sweep-line, hash tables for graph and counter state, binary search on sorted intervals, bit manipulation for CIDR / firewall design, sparse matrix representation, dynamic programming + sliding window + greedy combinations, and Morris constant-space binary-tree traversal. The framings are DE-flavored (upload byte counts, IP firewall rules, sparse matrix-vector multiplications) but the underlying patterns are general algorithm problems—not Spark API trivia.

This guide walks through the eight topic clusters Databricks actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 9-problem Databricks set (2 easy, 4 medium, 3 hard)—the hardest difficulty mix of any FAANG-adjacent DE company hub, so the article leans into the Hard tier rather than skipping it.

#	Topic (sections 1–8)	Why it shows up at Databricks
1	Array sorting and pairwise patterns in Python	Pairwise ascending swap—test fluency with `sort` and adjacent-pair iteration.
2	Intervals and sweep-line algorithms in Python	Consecutive upload byte count, lamps illuminating control points—endpoint events scanned left to right.
3	Hash tables for counting and graph state in Python	Unix command call counter—per-source counters and adjacency-list-style maps.
4	Binary search on intervals and sorted data in Python	Lamps illuminating control points—`bisect_left` to locate the first interval covering a query point.
5	Bit manipulation and CIDR / firewall design in Python	IP firewall CIDR rules—IPv4 as a 32-bit integer + prefix-mask matching.
6	Sparse matrix representation with hash tables in Python	Sparse matrix-vector multiplication—`defaultdict(dict)` over `(i, j)` keys for O(nnz) dot products.
7	Dynamic programming, sliding window, and greedy in Python	Maximum prefix removal operations—the Hard-tier multi-paradigm problem.
8	Binary tree traversal in constant space (Morris) in Python	Binary tree constant-space traversal—Morris in-order with threaded successors, O(1) extra space.

1. Array Sorting and Pairwise Patterns in Python

Sorting and pairwise iteration in Python for data engineering

Python's list.sort() and sorted() are both TimSort: stable, O(n log n) worst case, O(n) on partially-sorted input. Beyond raw sort, many "make this array satisfy property X" prompts only need local neighbor relationships—pairwise iteration in O(n), not full sort in O(n log n). Read the prompt carefully and pick the cheaper algorithm.

Pro tip: Sorting gives you full ordering; pairwise gives you adjacent-pair relationships. The Databricks pairwise-swap problem (#6) tests whether you reach for the cheaper O(n) step-2 loop instead of arr.sort().

Stable sort and `key=` for custom orderings

list.sort(key=...) and sorted(iterable, key=...) accept a key function computed once per element; the result is faster than the deprecated cmp= and supports composite keys via tuple returns. Stability lets you chain sorts—sort by tie-break first, then by primary key—and the secondary order survives.

key=lambda x: (-x[0], x[1]) — primary descending via negation, secondary ascending; standard tuple-key idiom.
Stability — equal keys keep input order; built into TimSort, free.
cmp_to_key — slower; avoid unless you have a non-total-ordering quirk.
Worked example: sort [(2026, 'b'), (2025, 'a'), (2026, 'a'), (2025, 'b')] by (year, letter) → [(2025, 'a'), (2025, 'b'), (2026, 'a'), (2026, 'b')].

records = [(2026, 'b'), (2025, 'a'), (2026, 'a'), (2025, 'b')]
records.sort(key=lambda r: (r[0], r[1]))
# [(2025, 'a'), (2025, 'b'), (2026, 'a'), (2026, 'b')]

Pairwise swap to ascending order

Walk the array in non-overlapping pairs at indices (0,1), (2,3), … and swap each pair so the smaller element comes first. The result is "pairwise ascending" but not globally sorted. O(n) time, O(1) extra space—cheaper than a full sort by a factor of log n.

range(0, len(arr) - 1, 2) — step-2 index walk; the - 1 guards against odd length.
Invariant after one pass: every even index i has arr[i] <= arr[i+1].
Worked example: [3, 1, 4, 1, 5, 9, 2, 6] → process pairs (3,1), (4,1), (5,9), (2,6) → [1, 3, 1, 4, 5, 9, 2, 6].

def pairwise_ascending(arr: list[int]) -> list[int]:
    arr = arr[:]
    for i in range(0, len(arr) - 1, 2):
        if arr[i] > arr[i + 1]:
            arr[i], arr[i + 1] = arr[i + 1], arr[i]
    return arr

`zip(arr, arr[1:])` for adjacent-pair iteration

For overlapping adjacent pairs (positions (0,1), (1,2), (2,3), …), the Pythonic idiom is zip(arr, arr[1:])—lazy, no extra allocation. itertools.pairwise(arr) (Python 3.10+) is the same thing, slightly faster.

zip(arr, arr[1:]) — overlapping adjacent pairs; total count is len(arr) - 1.
itertools.pairwise(arr) — same, clearer; use when 3.10+ is available.
Worked example: count strictly increasing pairs in [1, 3, 2, 5, 5] → (1,3) ✓, (3,2) ✗, (2,5) ✓, (5,5) ✗ → 2.

def count_strict_increases(arr: list[int]) -> int:
    return sum(1 for a, b in zip(arr, arr[1:]) if a < b)

Common beginner mistakes

Using range(len(arr)) and indexing arr[i+1] without bounding i to len(arr) - 1 (off-by-one IndexError on the last element).
Confusing non-overlapping pairs (step 2) with overlapping adjacent pairs (step 1)—they answer different questions.
Reaching for arr.sort() when only local pair ordering is needed (O(n log n) instead of O(n)).
Forgetting that list.sort() mutates in place and returns None; sorted(list) returns a new list and leaves the original alone.
Using cmp_to_key for orderings expressible as tuple keys—the key= lambda is faster and clearer.

Python interview question on array sorting and pairwise patterns

Given an integer array of even length n, rearrange it in place so that every adjacent pair (arr[2k], arr[2k+1]) is in non-decreasing order. Use O(n) time and O(1) extra space.

Solution using a single-pass swap with step 2

def pairwise_ascending(arr: list[int]) -> None:
    for i in range(0, len(arr), 2):
        if i + 1 < len(arr) and arr[i] > arr[i + 1]:
            arr[i], arr[i + 1] = arr[i + 1], arr[i]

Why this works: Each pair (arr[i], arr[i+1]) is independent of every other pair—swapping within one pair never changes another pair's contents. So a single pass that visits every even index and conditionally swaps is sufficient. Total time is O(n / 2) = O(n), extra space is O(1) (in-place swap), and the algorithm is faster than the O(n log n) full-sort approach by a factor of log n. The bound check i + 1 < len(arr) defends against odd-length input (where the last element has no pair partner).

PYTHON
Topic — sorting
Sorting problems

Practice →

COMPANY
Databricks — sorting
Databricks-tagged sorting

Practice →

2. Intervals and Sweep-Line Algorithms in Python

Intervals and sweep-line algorithms in Python for data engineering

Interval problems are everywhere in DE—session ranges, byte chunks, IP ranges, calendar bookings. The trick that turns nearly every interval problem from O(n²) into O(n log n) is sort the endpoints, then sweep left to right: convert each interval [start, end) into two events (+1 at start, -1 at end), sort by position, walk with a running counter that answers "how many intervals are active here?".

Pro tip: Always represent intervals as half-open [start, end)—matches Python slices, makes adjacent-interval merging clean ([1, 5) and [5, 9) don't overlap), and avoids boundary off-by-one bugs. State your tie-break order at coincident events out loud.

Half-open intervals and merge by sort

The half-open convention [start, end) composes cleanly: two adjacent half-open intervals [1, 5) and [5, 9) cover positions 1..8 with no overlap and no gaps. Merging follows a fixed three-step recipe in O(n log n) total (sort + sweep).

Step 1 — sort by start. Puts intervals in scan order.
Step 2 — sweep with a running merged interval. If next start < current end, extend; else emit and start new.
Step 3 — emit the final merged interval after the loop.
Worked example: merge [(1, 4), (3, 7), (8, 10), (9, 12)]:

step	current merged	next	action	merged so far
0	—	`(1, 4)`	initialize	—
1	`(1, 4)`	`(3, 7)`	overlap (3 < 4) → extend end	—
2	`(1, 7)`	`(8, 10)`	gap (8 >= 7) → emit, start new	`[(1, 7)]`
3	`(8, 10)`	`(9, 12)`	overlap (9 < 10) → extend end	`[(1, 7)]`
4	`(8, 12)`	—	end → emit	`[(1, 7), (8, 12)]`

def merge_intervals(ivs: list[tuple[int, int]]) -> list[tuple[int, int]]:
    if not ivs:
        return []
    ivs = sorted(ivs)
    merged = [ivs[0]]
    for s, e in ivs[1:]:
        cs, ce = merged[-1]
        if s < ce:
            merged[-1] = (cs, max(ce, e))
        else:
            merged.append((s, e))
    return merged

Sweep-line: events at start/end, scan left to right

Decompose each interval into a +1 at start and -1 at end. Sort events; walk with a running counter. The counter at any moment equals the number of intervals active at the current position. O(n log n) for the sort, O(n) for the sweep.

+1 at start, -1 at end — the canonical event encoding.
Tie-break at coincident events — encode in events.sort(key=lambda e: (pos, kind)); choose +1 first or -1 first based on whether boundaries should overlap.
Generalizations — weighted events, 2D rectangles, streaming variants.
Worked example: intervals [(1, 5), (3, 7), (6, 9)] → events [(1,+1), (3,+1), (5,-1), (6,+1), (7,-1), (9,-1)] → running counter 1, 2, 1, 2, 1, 0 → max overlap = 2.

def max_overlap(ivs: list[tuple[int, int]]) -> int:
    events: list[tuple[int, int]] = []
    for s, e in ivs:
        events.append((s, 1))
        events.append((e, -1))
    events.sort()
    best = active = 0
    for _, delta in events:
        active += delta
        if active > best:
            best = active
    return best

Active-set count via heap or counter

Sometimes you need which intervals are currently active, not just the count. Keep a heap keyed by end positions for sorted-by-end-time access (O(log n) per event), or a set() of active ids for membership-only access (O(1) per event).

Heap of (end, id) — gives "latest-ending currently-active interval" in O(log n).
set() of active ids — O(1) add/remove; no order.
Worked example: intervals [(1, 5), (3, 7), (6, 9)], query "active set at position 4" → {(1,5), (3,7)}.

def active_at(ivs: list[tuple[int, int]], q: int) -> set[tuple[int, int]]:
    return {(s, e) for s, e in ivs if s <= q < e}

(Naive O(n) is fine for one query; for many queries, sort + binary-search is the §4 upgrade.)

Common beginner mistakes

Using closed intervals [start, end] and getting boundary off-by-one bugs on adjacent intervals.
Forgetting tie-break order at coincident events—the answer drifts by 1 at every boundary.
Sorting events by position only and ignoring the kind—non-deterministic on ties.
Storing full intervals in a heap instead of just (end, id)—wastes memory and slows comparisons.
Reaching for O(n²) nested loops when sort-then-sweep is O(n log n) (the difference is 1000× at n=10000).

Drill interval and sweep-line problems →

Python interview question on intervals and sweep-line

Given a list of half-open intervals [(start, end), ...], return the maximum number of intervals active at any single position and the position where the maximum is first achieved.

Solution using a sorted-event sweep-line

def max_overlap_with_position(ivs: list[tuple[int, int]]) -> tuple[int, int | None]:
    events: list[tuple[int, int]] = []
    for s, e in ivs:
        events.append((s, 1))
        events.append((e, -1))
    events.sort()
    best = active = 0
    best_pos: int | None = None
    for pos, delta in events:
        active += delta
        if active > best:
            best, best_pos = active, pos
    return best, best_pos

Why this works: Decomposing each interval into a +1 start event and -1 end event lets the sweep handle arbitrary interval overlaps in a single linear pass after the sort. The running active counter equals the number of currently-overlapping intervals; tracking the max across the scan gives the answer in O(n log n) total (dominated by the sort). The position-of-max is captured at the moment the counter strictly exceeds the previous best, so we get the first position achieving the maximum, matching the prompt's tie-break.

PYTHON
Topic — intervals
Interval problems

Practice →

PYTHON
Topic — sweep line
Sweep-line problems

Practice →

3. Hash Tables for Counting and Graph State in Python

Hash tables for counting and graph state in Python for data engineering

Hash tables are the workhorse for counting problems, frequency questions, and graph representations. Whenever the prompt says "count something per X" or "for each X, what's the most common Y", you're in hash-table territory. The shape choices—Counter, defaultdict(int), defaultdict(list), defaultdict(set), nested defaultdict(lambda: defaultdict(int))—all share O(1) average insert and lookup; pick by what data lives at the leaf.

Pro tip: Use defaultdict(lambda: defaultdict(int)) for two-axis counters—not defaultdict(defaultdict(int)) (no lambda), which shares the same inner counter across all outer keys. That subtle bug is a classic interview trap.

`defaultdict(int)` for counting graph edges

defaultdict(int) is the canonical counter: missing keys auto-zero on read, so counts[k] += 1 always works without a guard. For multigraph edge counts, key by (source, dest) tuples; for "edges out of u", use a nested defaultdict(lambda: defaultdict(int)) so counts[u][v] returns the multiplicity of edge u→v.

defaultdict(int) — flat counter; faster than dict + get(k, 0) + 1 (one access vs two).
Counter — locked to int leaves; exposes .most_common() and arithmetic operators.
Nested defaultdict(lambda: defaultdict(int)) — two-axis counter; counts[u][v] for graph multiplicities.
Worked example: edges [(A, B), (A, B), (A, C), (B, A)] → {A: {B: 2, C: 1}, B: {A: 1}}.

from collections import defaultdict

def edge_counts(edges: list[tuple[str, str]]) -> dict[str, dict[str, int]]:
    counts: dict[str, dict[str, int]] = defaultdict(lambda: defaultdict(int))
    for u, v in edges:
        counts[u][v] += 1
    return {u: dict(c) for u, c in counts.items()}

Adjacency list with `defaultdict(list)`

For unweighted graphs with per-node neighbor enumeration, the adjacency list is the right shape. graph[u].append(v) adds an edge; graph[u] returns the neighbor list. For undirected graphs, append both directions.

defaultdict(list) — removes the if u not in graph: graph[u] = [] boilerplate.
Freeze with dict(graph) — once built, plain dict is slightly faster on access.
Worked example: edges [(A, B), (A, C), (B, A), (B, C)] → {A: [B, C], B: [A, C]}.

from collections import defaultdict

def build_graph(edges: list[tuple[str, str]]) -> dict[str, list[str]]:
    graph: dict[str, list[str]] = defaultdict(list)
    for u, v in edges:
        graph[u].append(v)
    return dict(graph)

Per-source command counter

Databricks's Unix-command-call-counter problem (#115) is a "per-source per-target counter": for each source process, count calls to each command. Two defaultdict levels with an int leaf; O(n) over records, O(unique pairs) memory.

Outer key — source (e.g. pid).
Inner key — target (e.g. command name).
Leaf — integer count.
Worked example: records [(100, 'ls'), (100, 'cat'), (100, 'ls'), (200, 'ls'), (200, 'grep')] → {100: {ls: 2, cat: 1}, 200: {ls: 1, grep: 1}}.

from collections import defaultdict

def per_source_command_counts(records: list[tuple[int, str]]) -> dict[int, dict[str, int]]:
    counts: dict[int, dict[str, int]] = defaultdict(lambda: defaultdict(int))
    for pid, cmd in records:
        counts[pid][cmd] += 1
    return {pid: dict(c) for pid, c in counts.items()}

Common beginner mistakes

Using Counter when nested counts are needed—Counter[(u, v)] += 1 works but counts[u][v] is cleaner for two-axis queries.
Forgetting defaultdict(lambda: defaultdict(int))—using defaultdict(defaultdict(int)) (no lambda) shares the same inner counter across all outer keys.
Mutating the dict while iterating .items()—raises RuntimeError: dictionary changed size during iteration.
Reading a missing key on a defaultdict and then checking if k not in d—the read inserted k, so the check is always False.
Building an adjacency list with dict plus if u not in graph: graph[u] = []—correct but verbose; defaultdict(list) is one line.

Python interview question on hash tables and graph state

You're given a list of (caller_pid, command_name) tuples. Build a structure that, for any caller, returns its top-K most-called commands with deterministic tie-breaks (alphabetical).

Solution using nested `defaultdict` and `(-count, name)` sort key

from collections import defaultdict

def top_commands_per_caller(
    records: list[tuple[int, str]], k: int
) -> dict[int, list[str]]:
    counts: dict[int, dict[str, int]] = defaultdict(lambda: defaultdict(int))
    for pid, cmd in records:
        counts[pid][cmd] += 1
    out: dict[int, list[str]] = {}
    for pid, cmd_counts in counts.items():
        ordered = sorted(cmd_counts.items(), key=lambda x: (-x[1], x[0]))
        out[pid] = [c for c, _ in ordered[:k]]
    return out

Why this works: The outer defaultdict keyed by pid lets each process get its own private command counter without explicit initialization. A single pass over the records is O(n). For each pid, sorting its commands by (-count, name) resolves ties alphabetically—the same idiom Blog18 uses for word-frequency tie-breaks. Slicing [:k] bounds output to k items per pid; total cost is O(n + S * U log U) where S is sources and U is unique-commands-per-source. Average usage has U small, so the sort term is dominated by the linear scan.

PYTHON
Topic — hash table
Hash table problems

Practice →

COMPANY
Databricks — hash table
Databricks-tagged hash table

Practice →

4. Binary Search on Intervals and Sorted Data in Python

Binary search on intervals and sorted data in Python for data engineering

Binary search is the textbook O(log n) lookup primitive for sorted arrays; Python's bisect module gives boundary locators (bisect_left, bisect_right) and insertion variants (insort_left, insort_right). The pattern generalizes three ways: lookup in a sorted array, binary search on intervals (sort by start, bisect, walk left while end > q), and binary search on the answer (parametric search over a monotonic predicate). Databricks's lamps-illuminating-control-points problem (#116) uses pattern #2.

Pro tip: State the desired semantics out loud before reaching for bisect_left vs bisect_right. "First index where arr[i] >= x" is bisect_left; "first index where arr[i] > x" is bisect_right. The off-by-one between them is the most common interview bug.

`bisect_left` vs `bisect_right`

Both take a sorted list and a query x, returning an integer i such that inserting x at position i keeps the list sorted. They differ only on duplicates: left points at the first occurrence, right points one past the last.

bisect_left(arr, x) — leftmost insertion point; first index where arr[i] >= x.
bisect_right(arr, x) — rightmost insertion point; first index where arr[i] > x.
Count of x — bisect_right(arr, x) - bisect_left(arr, x).
Worked example: arr = [1, 3, 3, 5, 7], x = 3 → bisect_left = 1, bisect_right = 3, count = 2.

from bisect import bisect_left, bisect_right

def count_occurrences(arr: list[int], x: int) -> int:
    return bisect_right(arr, x) - bisect_left(arr, x)

Binary search for "first interval containing x"

Sort intervals by start; for query q, locate the rightmost interval whose start <= q via bisect_right(starts, q) - 1, then walk leftward as long as the interval's end > q. O(log n + k) per query—optimal for output-sensitive workloads.

bisect_right then walk left — only intervals with start <= q can cover q.
Half-open membership: s <= q < e.
Worked example: intervals [(1, 5), (3, 7), (4, 6), (8, 10)] sorted by start, query q = 5. bisect_right([1,3,4,8], 5) = 3 → candidate idx 2 = (4, 6). Check 6 > 5 ✓; walk left: (3, 7) ✓; (1, 5) → 5 > 5 ✗. Result: [(3, 7), (4, 6)].

from bisect import bisect_right

def intervals_containing(ivs: list[tuple[int, int]], q: int) -> list[tuple[int, int]]:
    ivs = sorted(ivs)
    starts = [s for s, _ in ivs]
    idx = bisect_right(starts, q) - 1
    out = []
    while idx >= 0 and ivs[idx][0] <= q:
        if ivs[idx][1] > q:
            out.append(ivs[idx])
        idx -= 1
    return out

Binary search on the answer (parametric)

When the answer is a number and feasible(X) can be checked in O(n), binary-search the value space in O(n log(max - min)). The decision predicate must be monotonic in X—the property that makes binary search applicable.

Template: while lo < hi: mid = (lo + hi) // 2; if feasible(mid): hi = mid; else: lo = mid + 1.
Bounds matter — lo = smallest plausible answer, hi = guaranteed-feasible upper bound; off-by-one here gives wrong answers.
Examples: "min ship capacity in K days," "smallest k to split array with sum bound," "min bandwidth to meet deadline."
Worked example: ship [1..10] in 5 days; lo = max = 10, hi = sum = 55 → answer 15.

def min_capacity_to_ship(weights: list[int], days: int) -> int:
    lo, hi = max(weights), sum(weights)

    def feasible(cap: int) -> bool:
        d = 1
        load = 0
        for w in weights:
            if load + w > cap:
                d += 1
                load = 0
            load += w
        return d <= days

    while lo < hi:
        mid = (lo + hi) // 2
        if feasible(mid):
            hi = mid
        else:
            lo = mid + 1
    return lo

Common beginner mistakes

Confusing bisect_left and bisect_right and getting boundary off-by-one bugs.
Reaching for a linear scan when the array is already sorted—wasting the O(log n) opportunity.
Forgetting to sort before bisecting—bisect requires sorted input, and Python won't warn if it isn't.
In binary-search-on-the-answer, picking hi = max(arr) when hi = sum(arr) is required (or vice versa)—wrong bounds give wrong answers.
Writing while lo <= hi instead of while lo < hi—infinite loop when lo == hi.

Python interview question on binary search and intervals

Given a sorted list of half-open lamp intervals [(start, end), ...] and a query point q, return the count of lamps illuminating q in O(log n + k) time per query, where k is the number of covering lamps.

Solution using `bisect_right` and a leftward walk

from bisect import bisect_right

def lamps_at(ivs: list[tuple[int, int]], q: int) -> int:
    ivs = sorted(ivs)
    starts = [s for s, _ in ivs]
    idx = bisect_right(starts, q) - 1
    count = 0
    while idx >= 0 and ivs[idx][0] <= q:
        if ivs[idx][1] > q:
            count += 1
        idx -= 1
    return count

Why this works: bisect_right(starts, q) finds the first index where the start exceeds q in O(log n); idx - 1 is the rightmost interval whose start is <= q—the latest possible candidate to cover q. The leftward walk visits every interval whose start is <= q; among those, the ones with end > q cover q. The walk terminates as soon as start > q (impossible because we sorted by start and walked leftward) or when we run off the front. Total cost: O(log n) for the bisect, O(k) for the walk where k is the number of intervals starting <= q. For typical "queries on an interval set" workloads, k is small relative to n.

PYTHON
Topic — binary search
Binary search problems

Practice →

PYTHON
Topic — sorting
Sorting problems

Practice →

5. Bit Manipulation and CIDR / Firewall Design in Python

Bit manipulation and CIDR design in Python for data engineering

CIDR notation (192.168.0.0/24) compresses IP ranges into a base address and prefix length. Matching an IP against a CIDR rule is a bit-mask op: (ip & mask) == (network & mask), where mask = (0xFFFFFFFF << (32 - prefix)) & 0xFFFFFFFF. The five primitives to memorize: & (mask), | (set), ^ (toggle), ~ (complement), << / >> (shift). For production scale, the canonical structure is a CIDR trie (radix tree on IP prefixes) with O(32) = O(1) per lookup regardless of rule count.

Pro tip: Always clamp shifts with & 0xFFFFFFFF. Python integers are arbitrary precision; without the clamp, 0xFFFFFFFF << k leaves 1s in bits above 31 and breaks the comparison.

IPv4 as a 32-bit integer

An IPv4 address is four 8-bit numbers but underneath is just a 32-bit unsigned int (big-endian: leftmost octet → highest bits). Integer representation lets bit operations replace string parsing on the hot path—dozens of times faster.

Encoding: (a << 24) | (b << 16) | (c << 8) | d.
192.168.1.5 → 0xC0A80105 = 3232235781.
stdlib alternative: int(ipaddress.IPv4Address(s)), but the manual form works without imports.

def ip_to_int(ip: str) -> int:
    a, b, c, d = (int(x) for x in ip.split("."))
    return (a << 24) | (b << 16) | (c << 8) | d

CIDR mask: `prefix / bit-length`

A /24 mask is 0xFFFFFF00: 24 ones followed by 8 zeros. Compute via mask = (0xFFFFFFFF << (32 - prefix)) & 0xFFFFFFFF. Match by AND-then-compare. Three operations, tens of nanoseconds in Python.

Mask formula: (0xFFFFFFFF << (32 - prefix)) & 0xFFFFFFFF.
Match test: (ip & mask) == (network & mask).
/0 matches everything — early-out avoids the shift-by-32 quirk.
Worked example: Is 192.168.1.5 in 192.168.0.0/16? 0xC0A80105 & 0xFFFF0000 = 0xC0A80000 = 0xC0A80000 & 0xFFFF0000 ✓.

def cidr_match(ip: int, cidr: tuple[int, int]) -> bool:
    network, prefix = cidr
    if prefix == 0:
        return True  # /0 matches everything
    mask = (0xFFFFFFFF << (32 - prefix)) & 0xFFFFFFFF
    return (ip & mask) == (network & mask)

Trie-based firewall match (longest prefix)

For thousands of rules at line rate, linear-scanning per packet is too slow. The production structure is a binary trie: each edge a bit, each node a partial prefix. Insert walks prefix_length bits and marks the rule; query walks 32 bits, returning the deepest marked node (longest-prefix match). O(32) = O(1) per query.

Trie node — left edge = 0 bit, right edge = 1 bit; mark with rule at insertion depth.
Longest-prefix match is built in: walk 32 bits, deepest mark wins.
For interview pace: linear-scan with CIDR mask is sufficient; mention the trie as the production-scale extension.
Worked example: rules [10.0.0.0/8, 10.1.0.0/16, 10.1.2.0/24], query 10.1.2.5 → all three match; longest = /24.

def longest_prefix_match(ip: int, rules: list[tuple[int, int]]) -> tuple[int, int] | None:
    best: tuple[int, int] | None = None
    best_prefix = -1
    for network, prefix in rules:
        if cidr_match(ip, (network, prefix)) and prefix > best_prefix:
            best, best_prefix = (network, prefix), prefix
    return best

Common beginner mistakes

Computing the mask without the & 0xFFFFFFFF clamp—Python's arbitrary-precision integers leave 1s in bits above 31, which then breaks the comparison.
Comparing ip & mask to network directly without applying mask to network—if the network address has nonzero host bits, the comparison fails even on a true match.
Special-casing /32 (host route) when the general code handles it correctly.
Forgetting that /0 matches everything—a default route. The if prefix == 0: return True early-out avoids a shift-by-32 quirk.
Reaching for ipaddress.IPv4Network.contains when the manual int-and-mask version is 10× faster on the hot path.

Practice bit manipulation interview problems →

Python interview question on bit manipulation and CIDR

Given a list of CIDR rules [(network_int, prefix_length), ...] and an IP address as a string, return the rule with the longest matching prefix (most specific match), or None if no rule matches.

Solution using bit-mask comparison and a max-prefix tracker

def ip_to_int(ip: str) -> int:
    a, b, c, d = (int(x) for x in ip.split("."))
    return (a << 24) | (b << 16) | (c << 8) | d

def cidr_match(ip: int, network: int, prefix: int) -> bool:
    if prefix == 0:
        return True
    mask = (0xFFFFFFFF << (32 - prefix)) & 0xFFFFFFFF
    return (ip & mask) == (network & mask)

def longest_match(ip_str: str, rules: list[tuple[int, int]]) -> tuple[int, int] | None:
    ip = ip_to_int(ip_str)
    best: tuple[int, int] | None = None
    best_prefix = -1
    for network, prefix in rules:
        if cidr_match(ip, network, prefix) and prefix > best_prefix:
            best, best_prefix = (network, prefix), prefix
    return best

Why this works: ip_to_int converts the dotted-quad string into a 32-bit integer once, then all matching is integer arithmetic—dozens of times faster than per-rule string parsing. cidr_match applies the prefix mask via two AND operations: any bits below the prefix length are zeroed in both the query and the rule's network, so the comparison ignores host bits. The prefix > best_prefix check ensures we keep the longest match across all rules—the standard longest-prefix-match semantics. Total cost is O(R) for R rules per query; for production scale, replace with a trie for O(1) per query.

PYTHON
Topic — bit manipulation
Bit manipulation problems

Practice →

PYTHON
Topic — design
Design problems

Practice →

6. Sparse Matrix Representation with Hash Tables in Python

Sparse matrices and sparse-matrix-vector multiplication in Python for data engineering

Real DE matrices are almost always sparse—99.9%+ zeros (recommender user×item, document×term, graph adjacency). Storing as hash tables keyed by (row, col) saves memory proportional to nnz (non-zeros) and enables O(nnz) matrix-vector multiplication. Three representations compete: flat dict[(i, j)] = v, nested defaultdict(dict) with m[i][j] = v, and Compressed Sparse Row (CSR) used by scipy / numpy.

Pro tip: Below 1% density, sparse wins on memory and speed. Above 50% density, dense wins (cache locality, vectorization). State the density threshold; interviewers grade it.

Dense vs sparse matrix storage

A dense 2D matrix is O(rows * cols) regardless of content; sparse is O(nnz). For a 10⁶ × 10⁶ matrix with 1000 non-zeros: dense ~8 TB, sparse ~50 KB—eight orders of magnitude.

Dense list[list] — O(1) (i, j) access; wastes memory at low density.
Flat dict[(i, j)] = v — O(nnz) memory, O(1) random access, O(nnz) row scan.
Nested defaultdict(dict) — O(nnz) memory, O(1) row lookup + O(row_nnz) row scan.
CSR — production representation; what scipy uses internally.

from collections import defaultdict

def to_sparse(triples: list[tuple[int, int, float]]) -> dict[int, dict[int, float]]:
    m: dict[int, dict[int, float]] = defaultdict(dict)
    for i, j, v in triples:
        if v != 0.0:
            m[i][j] = v
    return dict(m)

`dict[(i, j)] = v` vs `defaultdict(dict)`

For sparse matrix-vector multiplication, the nested form wins: the natural iteration is "for each row, for each non-zero in that row." Flat form is fine for arbitrary (i, j) access but turns "give me row i" into an O(nnz) full-scan.

Flat tuple key — single dict, single lookup; row queries are O(nnz).
Nested defaultdict(dict) — two lookups; row queries are O(row_nnz).
Worked example: non-zeros (0,1,7), (2,3,4), (4,0,9):

flat	nested
`{(0,1): 7, (2,3): 4, (4,0): 9}`	`{0: {1: 7}, 2: {3: 4}, 4: {0: 9}}`

def row_entries_flat(m: dict[tuple[int, int], float], i: int) -> list[tuple[int, float]]:
    return [(j, v) for (ii, j), v in m.items() if ii == i]

def row_entries_nested(m: dict[int, dict[int, float]], i: int) -> list[tuple[int, float]]:
    return list(m.get(i, {}).items())

Sparse matrix-vector multiplication in O(nnz)

The dense algorithm is O(rows * cols); the sparse rewrite iterates only over non-zeros: for each m[i][j] = v, accumulate output[i] += v * vector[j]. Total O(nnz). Embarrassingly parallel at the row level—Spark / dask / scipy parallelize it across partitions.

Loop: for each (i, row) in m; for each (j, v) in row; out[i] += v * vector[j].
Invariant: after the sweep, out[i] is the dot product of row i with vector.
Worked example: matrix (0,1,7), (1,0,2), (2,2,5), vector [10, 20, 30]:

row	non-zero	accumulation
0	`(1, 7)`	`7 * 20 = 140`
1	`(0, 2)`	`2 * 10 = 20`
2	`(2, 5)`	`5 * 30 = 150`

Output: [140, 20, 150].

def sparse_matvec(
    m: dict[int, dict[int, float]], v: list[float], rows: int
) -> list[float]:
    out = [0.0] * rows
    for i, row in m.items():
        for j, val in row.items():
            out[i] += val * v[j]
    return out

Common beginner mistakes

Storing zeros explicitly in the sparse representation—defeats the memory savings.
Iterating over a dense range(rows) × range(cols) and only multiplying non-zeros—still O(rows * cols) because of the iteration cost; the win is iterating only over non-zeros.
Using a flat dict[(i, j)] representation when row-by-row access is needed—per-row queries become O(nnz).
Forgetting that defaultdict(dict)'s m[i] creates an empty inner dict on read—pollutes the structure. Use m.get(i, {}) for read-only access.
Reaching for scipy-style CSR in an interview when nested defaultdict(dict) is sufficient and easier to explain.

Python interview question on sparse matrix-vector multiplication

Given a sparse matrix as a list of (row, column, value) triples, the total number of rows, and a dense vector of the appropriate length, return the dense output vector of the matrix-vector product in O(nnz + rows) time.

Solution using `defaultdict(dict)` and a single sweep over non-zeros

from collections import defaultdict

def sparse_matvec(
    triples: list[tuple[int, int, float]], rows: int, vector: list[float]
) -> list[float]:
    m: dict[int, dict[int, float]] = defaultdict(dict)
    for i, j, v in triples:
        m[i][j] = v
    out = [0.0] * rows
    for i, row in m.items():
        for j, val in row.items():
            out[i] += val * vector[j]
    return out

Why this works: Building the nested representation is O(nnz). The matrix-vector product iterates exactly nnz times—each non-zero contributes one multiply-add to the output. The output vector is rows long and initialized in O(rows). Total: O(nnz + rows). For sparse data with nnz << rows * cols, this is dramatically faster than the O(rows * cols) dense algorithm—often 10×–100× on real recommender / document-term matrices.

PYTHON
Topic — matrix
Matrix problems

Practice →

PYTHON
Topic — hash table
Hash table problems

Practice →

7. Dynamic Programming, Sliding Window, and Greedy in Python

Dynamic programming, sliding window, and greedy in Python for data engineering

Hard-tier optimization problems are rarely a single named technique—they're hybrids. The mental decision tree: DP when sub-problems compose top-down with memoization; greedy when local-best provably implies global-best (cheaper than DP, but you owe an exchange-argument proof); sliding window when the answer is a contiguous range that grows or shrinks under a predicate. Hybrids—DP-on-window, greedy-after-sort, DP-with-greedy-transitions—are the norm at the Hard tier.

Pro tip: State the family before writing code. "This is DP because no greedy-choice property holds for arbitrary denominations." "This is sliding window because the predicate is monotonic in window size." Interviewers grade the reasoning more than the typing.

When DP beats greedy and when greedy beats DP

Greedy works when the greedy-choice property holds: locally optimal choices compose into a globally optimal solution (activity selection, Huffman, Kruskal). DP wins when the property fails (knapsack, edit distance, LIS). The interview decision rule: if you can sketch an exchange argument, use greedy; else default to DP.

Greedy works — activity selection (sort by end), Huffman (merge two min-frequency), Kruskal (lowest-weight non-cycle edge).
DP wins — knapsack 0/1, edit distance, longest increasing subsequence.
Worked example — coin change: denominations [1, 5, 10, 25] → greedy works. Denominations [1, 3, 4], N = 6 → greedy gives 4 + 1 + 1 = 3 coins; DP gives 3 + 3 = 2. Greedy fails because the greedy-choice property doesn't hold for arbitrary denominations.

def min_coins_dp(coins: list[int], n: int) -> int:
    INF = float("inf")
    dp = [0] + [INF] * n
    for i in range(1, n + 1):
        for c in coins:
            if c <= i:
                dp[i] = min(dp[i], dp[i - c] + 1)
    return dp[n] if dp[n] != INF else -1

Sliding window with shrinking left pointer

Two pointers left and right define the window [left, right]. Right always advances; left advances as needed to maintain the invariant. Each pointer moves at most n times across the scan—total work O(n) amortized despite the inner while loop.

Right advance — add arr[right] to window state.
Left advance — while invariant violated, remove arr[left] and advance left.
Common invariants — "all distinct", "sum ≤ target", "at most k distinct".
Worked example — longest substring with all distinct characters in "abcabcbb":

step	right	window	left advances	best
1	0 (a)	`a`	—	1
2	1 (b)	`ab`	—	2
3	2 (c)	`abc`	—	3
4	3 (a)	`abca` (dup)	left → 1 → `bca`	3
5	4 (b)	`bcab` (dup)	left → 2 → `cab`	3
6	5 (c)	`cabc` (dup)	left → 3 → `abc`	3

Result: 3.

def longest_distinct(s: str) -> int:
    seen: dict[str, int] = {}
    left = best = 0
    for right, ch in enumerate(s):
        if ch in seen and seen[ch] >= left:
            left = seen[ch] + 1
        seen[ch] = right
        best = max(best, right - left + 1)
    return best

Combining DP + sliding window for prefix-removal

Databricks's hardest problem (#322) combines DP and sliding window. Define dp[i] = min operations to clear suffix s[i:]. Transition: from i, find the rightmost j such that s[i:j] satisfies a problem-specific constraint (via sliding window), then dp[i] = 1 + dp[j]. Iterate right-to-left; answer is dp[0].

dp[i] = min ops on suffix s[i:], filled right-to-left.
Sliding window per state — extend j while constraint holds; shrink when broken.
Complexity — O(n²) worst case, O(n * k) for bounded-distinct inputs.
Worked example — at-most-one 1 per removed prefix on "10110":

op	remaining	removed
1	`"10110"` → `"0110"`	`"1"`
2	`"0110"` → `"10"`	`"011"` (one 1)
3	`"10"` → `""`	`"10"` (one 1)

Total: 3 operations.

def min_prefix_removals(s: str) -> int:
    n = len(s)
    dp = [0] * (n + 1)
    for i in range(n - 1, -1, -1):
        ones = 0
        j = i
        while j < n and ones + (s[j] == "1") <= 1:
            ones += s[j] == "1"
            j += 1
        dp[i] = 1 + dp[j]
    return dp[0]

Common beginner mistakes

Reaching for greedy when DP is required (no greedy-choice property)—silently wrong on adversarial inputs.
Forgetting to memoize a recursive DP—exponential time on a polynomial-time problem.
Using two nested loops where a sliding window with two pointers suffices—O(n²) instead of O(n).
Off-by-one on the right-pointer advancement in sliding window—either skip an element or process the same element twice.
Not stating the invariant of the sliding window before writing code—high error rate on the boundary logic.

Python interview question on DP + sliding window

Given a string s containing only lowercase letters, you can remove any prefix in one operation. Each removed prefix must contain at most k distinct letters. Return the minimum number of operations to empty the string.

Solution using DP filled right-to-left with a sliding-window per state

def min_ops(s: str, k: int) -> int:
    from collections import Counter

    n = len(s)
    dp = [0] * (n + 1)
    for i in range(n - 1, -1, -1):
        cnt: Counter = Counter()
        j = i
        while j < n:
            cnt[s[j]] += 1
            if len(cnt) > k:
                cnt[s[j]] -= 1
                if cnt[s[j]] == 0:
                    del cnt[s[j]]
                break
            j += 1
        dp[i] = 1 + dp[j]
    return dp[0]

Why this works: dp[i] stores the minimum operations to clear suffix s[i:]. The right-to-left fill order ensures dp[j] is computed before we need it for state i. For each i, the inner while loop finds the longest prefix starting at i that satisfies the at-most-k-distinct constraint—a forward sliding window with a Counter tracking distinct letters. As soon as the constraint breaks, j is the boundary; we charge one operation for removing s[i:j] and recurse to dp[j]. Total complexity is O(n²) worst case (each i may scan to the end); O(n * k) for typical bounded-distinct inputs.

PYTHON
Topic — dynamic programming
Dynamic programming problems

Practice →

PYTHON
Topic — sliding window
Sliding window problems

Practice →

8. Binary Tree Traversal in Constant Space (Morris) in Python

Morris constant-space binary-tree traversal in Python for data engineering

The standard binary-tree in-order traversal uses O(h) extra space (recursion stack or explicit stack). Morris traversal achieves O(1) extra space by threading the tree—temporarily pointing each subtree's rightmost-leaf to its in-order successor, then unthreading on the way back. The trick (J. H. Morris, 1979) remains the canonical answer to "traverse without recursion or a stack." Production tree traversals on huge trees (parse trees, decision trees, hierarchies) can easily exceed call-stack limits; Morris bounds memory at one or two pointers regardless of tree size.

Pro tip: Morris mutates and restores—the input tree is structurally unchanged on exit. State this contract; in concurrent systems a reader mid-traversal sees threaded edges, which is unsafe without locking.

Recursive vs iterative-with-stack traversal

Three pre-Morris approaches: recursion (O(h) call stack; risks overflow), explicit stack (O(h) heap; no overflow), and parent pointers (O(1) if the tree node carries parent). Morris achieves O(1) on a vanilla (left, right) node by manufacturing back-pointers temporarily.

Recursive — clean code; stack-overflow risk on degenerate trees.
Explicit stack — same O(h) space, no recursion.
Parent pointers — O(1) space but requires the back-pointer in the tree representation.
Worked example tree:

In-order: 2, 3, 4, 5, 7. Recursive call stack to visit 2: 3 frames (5 → 3 → 2). Iterative-with-stack: 3 slots.

def inorder_recursive(node) -> list[int]:
    out: list[int] = []
    def go(n):
        if not n:
            return
        go(n.left)
        out.append(n.val)
        go(n.right)
    go(node)
    return out

Morris: thread the in-order successor

Walk with a single pointer cur. State machine: if cur.left is None, visit and move right. Else find the rightmost descendant of the left subtree (the in-order predecessor); if its right is None, thread it to cur and recurse left; if its right is cur, unthread, visit cur, and move right.

Thread — predecessor.right = cur on first visit; replaces the implicit stack frame.
Unthread — predecessor.right = None on second visit; restores the tree.
Total work O(n); extra space O(1) for cur and pred.
Worked example on the 5-node tree above:

step	cur	action	output
1	5	left exists; pred = 4; thread 4.right = 5; cur → 3	—
2	3	left exists; pred = 2; thread 2.right = 3; cur → 2	—
3	2	no left; visit 2; cur → 3 (via thread)	`[2]`
4	3	left exists; pred.right == 3; unthread; visit 3; cur → 4	`[2, 3]`
5	4	no left; visit 4; cur → 5 (via thread)	`[2, 3, 4]`
6	5	left exists; pred.right == 5; unthread; visit 5; cur → 7	`[2, 3, 4, 5]`
7	7	no left; visit 7; cur → None	`[2, 3, 4, 5, 7]`

def morris_inorder(root) -> list[int]:
    out: list[int] = []
    cur = root
    while cur:
        if cur.left is None:
            out.append(cur.val)
            cur = cur.right
        else:
            # find in-order predecessor (rightmost in left subtree)
            pred = cur.left
            while pred.right and pred.right is not cur:
                pred = pred.right
            if pred.right is None:
                # thread it
                pred.right = cur
                cur = cur.left
            else:
                # unthread, visit cur, move right
                pred.right = None
                out.append(cur.val)
                cur = cur.right
    return out

Restoring the tree after Morris pass

Threading on first visit, unthreading on second. Every thread is unthreaded before exit, so the tree on exit is structurally identical to the input. Single-threaded only: a concurrent reader mid-traversal can follow a threaded edge to the wrong node.

Single mutation pattern — pred.right = cur (thread), pred.right = None (unthread).
Pre-/post-order Morris exist but are noticeably trickier; default to in-order in interview pace.
Concurrency: unsafe without locks; mention this if the interviewer probes.
Worked example: after the 7-step pass, the tree is byte-for-byte identical to the input—verify by walking each node's right pointer.

Common beginner mistakes

Forgetting to unthread—the tree is left mutated on exit, causing infinite loops in subsequent traversals.
Using == (value equality) instead of is (identity) when checking pred.right is not cur—works for unique values, breaks for duplicate values.
Stopping the predecessor walk at the first None instead of "first None or already-equal-to-cur"—infinite loops on threaded subtrees.
Implementing pre-order or post-order Morris from scratch—those exist but are noticeably trickier than in-order; default to in-order in interview pace and mention the variants.
Using Morris on a tree shared with another thread—the temporary mutations make it unsafe without locking.

See more tree traversal problems →

Python interview question on Morris traversal

Implement an in-order traversal of a binary tree using O(1) extra space. The tree node has val, left, and right attributes. The tree's structure must be unchanged on exit.

Solution using Morris threading

def morris_inorder(root) -> list[int]:
    out: list[int] = []
    cur = root
    while cur:
        if cur.left is None:
            out.append(cur.val)
            cur = cur.right
        else:
            pred = cur.left
            while pred.right and pred.right is not cur:
                pred = pred.right
            if pred.right is None:
                pred.right = cur
                cur = cur.left
            else:
                pred.right = None
                out.append(cur.val)
                cur = cur.right
    return out

Why this works: The algorithm replaces the implicit recursion stack with a temporary thread—each node's in-order predecessor points back to it via a right-pointer mutation. When the left-subtree traversal completes, the thread provides the path back; we then unthread (restoring the original tree), visit the node, and recurse right. Each node is visited at most twice (once to thread, once to visit + unthread), so total time is O(n). Extra space is O(1)—just cur and pred pointers, no stack. Because every thread is unthreaded before exit, the tree on exit is byte-for-byte identical to the input—Morris mutates and restores.

PYTHON
Topic — tree traversal
Tree traversal problems

Practice →

PYTHON
Topic — binary tree
Binary tree problems

Practice →

Tips to crack Databricks data engineering interviews

These are habits that move the needle in real Databricks loops—not a re-statement of the topics above.

Python algorithm preparation

Databricks's curated set is all Python, no SQL. Spend most of your prep on stdlib fluency (bisect, collections, itertools, functools), classic algorithm patterns (sweep-line, binary search, monotonic stack, two-pointer sliding window, DP, graph traversal), and bit manipulation for integer-encoded representations like CIDR. The intervals, binary search, and bit manipulation topic pages cover the bulk.

Hard-tier preparation

3 of 9 problems are Hard. Do not skip the Hard tier—candidates who avoid Morris traversal, DP+sliding-window, and sparse matrices will under-perform. Work through Morris in-order from memory at least five times before your interview. Be ready to derive the DP-on-window recurrence on a whiteboard. Recognize sparse representation as the "trade memory for speed via hash-table-of-hash-tables" trick.

Pipeline thinking

Databricks's framings are DE-flavored even when the algorithms are general: upload byte counts, IP firewall rules, sparse matrix-vector multiplications. The interviewer is testing whether you map the framing to the algorithm correctly. State the mapping out loud: "this is intervals + sweep-line"; "this is binary search on the answer"; "this is sparse matrix product." Mapping framings to families is the meta-skill.

Where to practice on PipeCode

Skill lane	Practice path
Curated Databricks practice set	/explore/practice/company/databricks
Sorting	/explore/practice/topic/sorting
Intervals	/explore/practice/topic/intervals
Sweep line	/explore/practice/topic/sweep-line
Hash table	/explore/practice/topic/hash-table
Binary search	/explore/practice/topic/binary-search
Bit manipulation	/explore/practice/topic/bit-manipulation
Dynamic programming	/explore/practice/topic/dynamic-programming
Tree traversal	/explore/practice/topic/tree-traversal
All practice topics	/explore/practice/topics
Interview courses	/explore/courses

Communication under time pressure

State assumptions before typing: "I'll assume IPv4, no IPv6"; "I'll assume the matrix fits the sparse representation, density < 1%"; "I'll assume the tree is mutable—Morris is safe." State invariants after key code blocks: "after this loop, every interval starting before q has been examined." State complexity: "this is O(n log n) for the sort, O(n) for the sweep, total O(n log n)." Interviewers grade clear reasoning above silent-and-perfect.

Frequently Asked Questions

What is the Databricks data engineering interview process like?

The Databricks data engineering interview typically includes a phone screen (Python warm-up around arrays or hash tables), one or two coding rounds focused on Python algorithms, a system-design conversation around pipelines and data architectures, and behavioral interviews. The curated 9-problem Databricks practice set on PipeCode mirrors what you will see on the technical rounds.

Does Databricks test SQL in their data engineering interviews?

The curated Databricks practice set is 100% Python—no SQL problems among the nine. Other Databricks interviewers may bring SQL in ad-hoc rounds, but the published company set is algorithm-heavy Python. Prepare for SQL separately if your role calls for it; the curated set will not drill it.

How important is Python for a Databricks data engineering interview?

Python is essentially the entire technical interview at Databricks—Python algorithms, stdlib fluency, and idiomatic patterns. Memorize: bisect_left/right, collections.defaultdict, collections.Counter, itertools.pairwise, functools.lru_cache, sweep-line, binary search, monotonic stack, sliding window, Morris traversal, sparse matrix representation. Stdlib fluency separates a clean answer from a 30-line manual loop.

How hard are Databricks data engineering interview questions?

Databricks's curated set has the highest Hard-question ratio of any FAANG-adjacent DE company hub—3 of 9 are Hard, including Morris traversal and DP+sliding-window. Candidates who skip the Hard tier will under-perform. Spend extra prep time on the patterns covered in §6, §7, §8 of this guide.

What algorithm topics should I prioritize for Databricks?

In rough order: (1) sweep-line over intervals, (2) hash tables for counters and graph state, (3) binary search variants (bisect, parametric), (4) bit manipulation for CIDR / packed structures, (5) sparse matrix representation, (6) DP-on-sliding-window, (7) Morris constant-space traversal. The first four show up at Easy / Medium; the last three at Hard. The intervals, binary search, and tree traversal topic pages cover the spread.

How many Databricks practice problems should I solve before the interview?

Aim for 30–50 problems spanning all eight topic clusters above—not 100 of the same kind. Solve every problem in the Databricks-tagged practice set, then back-fill weak areas using the topic pages linked throughout this guide. Pay extra attention to the Hard tier; that's where Databricks distinguishes between candidates.

Start practicing Databricks data engineering problems

Reading patterns is not the same as typing them under time pressure. PipeCode pairs company-tagged Databricks problems with tests, AI feedback, and a coding environment so you can drill the exact Python algorithms Databricks asks—without the noise of generic SQL prep that doesn't apply to this loop.

Pipecode.ai is Leetcode for Data Engineering.

Browse Databricks practice →
View all practice →

Top Databricks data engineering interview topics

1. Array Sorting and Pairwise Patterns in Python

Sorting and pairwise iteration in Python for data engineering

Stable sort and key= for custom orderings

Pairwise swap to ascending order

zip(arr, arr[1:]) for adjacent-pair iteration

Python interview question on array sorting and pairwise patterns

Solution using a single-pass swap with step 2

2. Intervals and Sweep-Line Algorithms in Python

Intervals and sweep-line algorithms in Python for data engineering

Half-open intervals and merge by sort

Sweep-line: events at start/end, scan left to right

Active-set count via heap or counter

Python interview question on intervals and sweep-line

Solution using a sorted-event sweep-line

3. Hash Tables for Counting and Graph State in Python

Hash tables for counting and graph state in Python for data engineering

defaultdict(int) for counting graph edges

Adjacency list with defaultdict(list)

Per-source command counter

Python interview question on hash tables and graph state

Solution using nested defaultdict and (-count, name) sort key

4. Binary Search on Intervals and Sorted Data in Python

Binary search on intervals and sorted data in Python for data engineering

bisect_left vs bisect_right

Binary search for "first interval containing x"

Binary search on the answer (parametric)

Python interview question on binary search and intervals

Solution using bisect_right and a leftward walk

5. Bit Manipulation and CIDR / Firewall Design in Python

Bit manipulation and CIDR design in Python for data engineering

IPv4 as a 32-bit integer

CIDR mask: prefix / bit-length

Trie-based firewall match (longest prefix)

Python interview question on bit manipulation and CIDR

Solution using bit-mask comparison and a max-prefix tracker

6. Sparse Matrix Representation with Hash Tables in Python

Sparse matrices and sparse-matrix-vector multiplication in Python for data engineering

Dense vs sparse matrix storage

dict[(i, j)] = v vs defaultdict(dict)

Sparse matrix-vector multiplication in O(nnz)

Python interview question on sparse matrix-vector multiplication

Solution using defaultdict(dict) and a single sweep over non-zeros

7. Dynamic Programming, Sliding Window, and Greedy in Python

Dynamic programming, sliding window, and greedy in Python for data engineering

When DP beats greedy and when greedy beats DP

Sliding window with shrinking left pointer

Combining DP + sliding window for prefix-removal

Python interview question on DP + sliding window

Solution using DP filled right-to-left with a sliding-window per state

8. Binary Tree Traversal in Constant Space (Morris) in Python

Morris constant-space binary-tree traversal in Python for data engineering

Recursive vs iterative-with-stack traversal

Morris: thread the in-order successor

Restoring the tree after Morris pass

Python interview question on Morris traversal

Solution using Morris threading

Tips to crack Databricks data engineering interviews

Python algorithm preparation

Hard-tier preparation

Pipeline thinking

Where to practice on PipeCode

Communication under time pressure

Frequently Asked Questions

What is the Databricks data engineering interview process like?

Does Databricks test SQL in their data engineering interviews?

How important is Python for a Databricks data engineering interview?

How hard are Databricks data engineering interview questions?

What algorithm topics should I prioritize for Databricks?

How many Databricks practice problems should I solve before the interview?

Start practicing Databricks data engineering problems

Stable sort and `key=` for custom orderings

`zip(arr, arr[1:])` for adjacent-pair iteration

`defaultdict(int)` for counting graph edges

Adjacency list with `defaultdict(list)`

Solution using nested `defaultdict` and `(-count, name)` sort key

`bisect_left` vs `bisect_right`

Solution using `bisect_right` and a leftward walk

CIDR mask: `prefix / bit-length`

`dict[(i, j)] = v` vs `defaultdict(dict)`

Solution using `defaultdict(dict)` and a single sweep over non-zeros