DEV Community

Cover image for Airbnb Data Engineering Interview Questions & Prep
Gowtham Potureddi
Gowtham Potureddi

Posted on

Airbnb Data Engineering Interview Questions & Prep

Airbnb data engineering interview questions lean on SQL for marketplace-style analytics (aggregations, joins, window functions, and sessionization from event streams) and Python for data structures, intervals, and occasional graph or pandas-style coding. Deeper rounds often add verbal dimensional modeling and data warehouse design (star schema, SCDs, booking grain).

Everything below uses original teaching examples only—not proprietary interview items. The guide is structured for data engineering interview prep: worked examples in SQL and Python, a data modeling section, and topic-aligned practice via PipeCode’s Airbnb company hub. Narrate grain, join semantics, and NULL behavior as you go; interviewers care about the reasoning, not just the final answer.

PipeCode blog header for Airbnb data engineering interview prep with SQL, Python, and marketplace analytics themes on a dark gradient.


Top Airbnb data engineering interview topics

# Topic Why it shows up
1 Aggregation & GROUP BY / HAVING Nightly price by market, booking counts, O–D (origin–destination) pairs, listing metrics.
2 Joins & anti-joins Listings to bookings, page events to sessions, avoid fan-out.
3 Window functions “First” booking, distinct visitors per scope, LAG / LEAD, ranking.
4 Gaps, sessions & time Session ids from inactivity (e.g. 30+ minutes), half-open time ranges.
5 Subqueries, CASE, market-style reports Filtered rollups, market pulse, conditional metrics.
6 Python: arrays, intervals, graphs, pandas Max revenue from bookings, multi-dimensional iterators, attribute inheritance, merges.
7 Star schema, rental facts & SCDs STR grain, slowly changing listing or host attributes.

SQL evaluation order (mental model): FROM / joins → WHERE (row filter) → GROUP BY → aggregates → HAVING (group filter) → windows → SELECT / ORDER BY. If you are filtering a sum or count, that condition almost always belongs in HAVING, not WHERE.


1. Aggregation and GROUP BY in marketplace SQL

Aggregation and GROUP BY in SQL for data engineering

Picture a table with many detail rows—often one row per booking or per night line item. Aggregation means: turn lots of rows into one summary value (or a few) per bucket. The bucket is whatever the prompt names: per market, per listing, per origin–destination pair, or per (listing, week). GROUP BY names that bucket: every distinct combination of the grouping keys is one group; the engine runs SUM, COUNT, AVG inside each group separately.

Report grain (say this in the interview): After the query, ask, “One output row is what, in business terms?” If the question asked “gross booking dollars per market,” the grain is market. If you GROUP BY at listing by mistake, you are answering a different question.

SUM, AVG, and NULL

  • SUM(col) adds non-null numerics in the group. NULL is skipped (not coerced to 0). If every value in the group is NULL, SUM is usually NULL, not 0.
  • AVG(col) is sum of non-null values ÷ count of non-null values in that group.

Worked example: In one market_id group, amount_usd values 200, NULL, 300SUM(amount_usd) = 500, AVG(amount_usd) = 250 (only two values averaged).

COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col)

  • COUNT(*) counts rows in the group, including rows where some column is NULL.
  • COUNT(col) counts rows where col is not NULL.
  • COUNT(DISTINCT col) counts unique non-null values—required after a join that multiplies a guest or listing across many rows, when the question is “how many guests” not “how many booking rows.”

Worked example: After a one-to-many join, two rows for the same guest_idCOUNT(*) = 2, COUNT(DISTINCT guest_id) = 1.

MIN / MAX

Use for prices, dates, ranks. NULL is skipped (same as in SUM semantics). In marketplace data, MAX(created_at) is often “latest booking in this group” before you switch to a window in §3.

WHERE vs HAVING

  • WHERE filters input rows before GROUP BY—predicates on raw columns of the table as seen before grouping.
  • HAVING filters after GROUP BY on aggregates—e.g. HAVING SUM(amount) > 1000 or HAVING COUNT(*) >= 2.

Worked example — per-market subtotals, two different filters:

booking_id market_id subtotal is_cancelled
b1 M1 80 false
b2 M1 30 true
b3 M2 200 false
  • WHERE NOT is_cancelled then GROUP BY market_id, SUM(subtotal)M1 = 80, M2 = 200 (canceled row dropped first).
  • Same table without WHERE, then HAVING SUM(subtotal) > 50 (sum of all subtotals in the group) → both markets can have different SUM than in the first bullet—be explicit which population the question means (all rows vs. completed only).

Rule of thumb: If the filter uses an aggregate of the group (SUM, COUNT, AVG), it belongs in HAVING, not WHERE. If it only uses one row at a time (e.g. status = 'confirmed'), use WHERE.

Half-open date ranges and revenue by period

Worked example: Nightly amount for calendar year 2024 per market_id:

SELECT market_id, SUM(amount_usd) AS rev
FROM night_stays
WHERE stay_date >= DATE '2024-01-01' AND stay_date < DATE '2025-01-01'
GROUP BY market_id;
Enter fullscreen mode Exit fullscreen mode

Takeaway: The interval [2024-01-01, 2025-01-01) (half-open) avoids double counting a boundary if you also store timestamps at midnight. Say timezone if the table is TIMESTAMPTZ and the business rule is not UTC.

Conditional aggregation: SUM(CASE …) and net vs gross

Idea: Sum or count only rows that meet a business rule without a separate WHERE that would drop the rest of the group from the query entirely when you need both “gross” and “net” in one SELECT, or you must include partial rows in the denominator of a rate.

Worked example — net booking dollars after cancellations (same as before, now with **full pattern):**

booking_id market_id amount_usd is_cancelled
b1 M1 200 false
b2 M1 100 true

SUM(CASE WHEN NOT is_cancelled THEN amount_usd ELSE 0 END) = 200; gross would be SUM(amount_usd) = 300 if the prompt asked for both columns.

Common beginner mistakes

  • Putting a condition on SUM(x) in WHERE instead of HAVING.
  • COUNT(*) when the business asked for unique guests or listings—use COUNT(DISTINCT …) at the right grain.
  • Joining one-to-many before GROUP BY and summing a dimension field that repeatsdouble count; fix at grain or dedupe first (see §2).

SQL interview question on net booking volume by market

Table: bookings(booking_id, market_id, amount_usd, is_cancelled). Return each market_id and net booking volume (sum of amount_usd where NOT is_cancelled) for markets where that net sum is strictly positive, ordered by net volume descending.

Solution using GROUP BY and conditional SUM

SELECT market_id,
       SUM(CASE WHEN NOT is_cancelled THEN amount_usd ELSE 0 END) AS net_vol
FROM bookings
GROUP BY market_id
HAVING SUM(CASE WHEN NOT is_cancelled THEN amount_usd ELSE 0 END) > 0
ORDER BY net_vol DESC;
Enter fullscreen mode Exit fullscreen mode

Why this works: Cancellations contribute 0 in the CASE; HAVING filters per-market aggregates, not per-row. Grain: one row per market_id.

SQL interview question on distinct active guests by market (post-join)

Table: bookings(booking_id, market_id, guest_id, is_cancelled) (one row per booking). Return each market_id and COUNT(DISTINCT guest_id) among non-cancelled bookings only, ordered by that count descending.

Solution using COUNT(DISTINCT …) and WHERE

SELECT market_id, COUNT(DISTINCT guest_id) AS active_guests
FROM bookings
WHERE NOT is_cancelled
GROUP BY market_id
ORDER BY active_guests DESC;
Enter fullscreen mode Exit fullscreen mode

Why this works: WHERE restricts to completed rows before grouping; COUNT(DISTINCT guest_id) counts people, not booking rows, at market grain. If a guest has two completed bookings in the same market, they still count once.

Practice

When you want SQL by difficulty on the same hub, open Airbnb SQL (all difficulties) or Airbnb easy SQL to warm up.


2. Joins, anti-joins, and deduplication in SQL

Joins and deduplication in SQL for data engineering

A join answers: “For each row in A, which row(s) in B match the ON condition?” In marketplace analytics, you join facts (bookings, page views, impressions) to dimensions (listing, market, user profile) to get attributes for reporting. You are evaluated on: inner vs outer use, anti-join for “has no …”, and fan-out when a key is not unique on one side.

Fact vs dimension and grain (how interviewers want you to talk about joins):

  • Fact: many rows, often one row per event or booking—the thing you SUM or COUNT.
  • Dimension: one row per entity (in an ideal, clean model)—listing, host, city—enriched via the join. If a dimension is not unique (duplicate listing_id rows), a join to facts can multiply rows; summing a dimension’s attribute then inflates the metric.

Diagram of SQL join fan-out (row multiplication) versus deduplicated group-by for Airbnb data engineering interview prep on PipeCode.

Inner join

Result: Only rows that satisfy the ON on both sides. Unmatched keys disappear.

Worked examplebookings inner join listings on listing_id: if a booking’s listing_id was deleted in source and is missing in listings, that booking vanishes from the result. If the question is “revenue for known listings,” that may be right; if it is “all revenue, including orphan bookings,” you need a left or fuzzy key strategy—state your assumption out loud.

Left join and the anti-join pattern

LEFT JOIN returns all left-side rows, padding right columns with NULL when there is no match. Anti-join: LEFT JOIN … WHERE right.not_null_col IS NULL keeps only left rows with no partner in B. Classic: “Guests who have never completed a booking.”

Worked example (mini): guests ids G1, G2; bookings has G1 only. FROM guests g LEFT JOIN bookings b ON g.id = b.guest_id WHERE b.booking_id IS NULL returns G2 only.

Semi-join: EXISTS / IN (keep left rows, do not add B’s columns)

Listings that had at least one booking in NYC” is often a filter on listings, not a full merge. WHERE EXISTS (SELECT 1 FROM bookings … WHERE … listing_id = listings.id) keeps one row per listing; compare to joining all bookings, then DISTINCT, which is heavier and easier to get wrong on grain.

Fan-out: when one listing matches many booking rows

Bug pattern: SUM(dimension.accommodates) after bookingslistings with one row per booking will add the same accommodates once per booking, which is not “total capacity in the market” but “sum of capacity over booking rows*.” **Fixes:* SUM at listing grain in a subquery, MAX/MIN of attribute per listing, or COUNT(DISTINCT listing_id)-style questions instead of a blind sum.

Worked example:

guest_id (dim) credit_score
1 720
booking_id guest_id
b1 1
b2 1

JOIN gives two rows; SUM(credit_score) = 1440 if you are not careful—wrong for “system-wide average credit of guests with bookings*.” Use *AVG per guest** in a CTE, or MAX(credit_score) if one value per guest_id is intended in the result.

Deduplication: ROW_NUMBER vs DISTINCT ON

Latest base_price per listing” in a type-2-ish history table: prefer ROW_NUMBER() OVER (PARTITION BY listing_id ORDER BY as_of_date DESC) AS rn then WHERE rn = 1 in an outer filter so you control ties. DISTINCT ON (listing_id) … ORDER BY … (PostgreSQL) is compact; tie semantics must match the business rule (e.g. break ties on ingest_id).

Worked example (pattern only):

SELECT listing_id, base_price, as_of_date
FROM (
  SELECT listing_id, base_price, as_of_date,
         ROW_NUMBER() OVER (
           PARTITION BY listing_id ORDER BY as_of_date DESC, ingest_id DESC
         ) AS rn
  FROM listing_price_history
) t
WHERE rn = 1;
Enter fullscreen mode Exit fullscreen mode

Common beginner mistakes

  • Joining on NULL = NULL: not true; rows do not match—use COALESCE or explicit IS NULL logic only if the spec allows it.
  • Inner join when the question needs kept empty sides (left / outer).
  • Aggregating at join-inflated grain without a pre-aggregate or distinct step.

Joins-based interview question on O–D pair volume

Tables: bookings(booking_id, origin_city, dest_city, amount_usd, is_cancelled). Return each ordered pair (origin_city, dest_city) with count of non-cancelled bookings, excluding pairs where count < 2. Order by count descending.

Solution using GROUP BY and HAVING

SELECT origin_city, dest_city, COUNT(*) AS n
FROM bookings
WHERE NOT is_cancelled
GROUP BY origin_city, dest_city
HAVING COUNT(*) >= 2
ORDER BY n DESC;
Enter fullscreen mode Exit fullscreen mode

Why this works: Grain = (origin, dest); WHERE filters rows; HAVING enforces the at least two completed bookings rule on the group.

Joins-based interview question on guests with no completed booking

Tables: guests(guest_id, email), bookings(booking_id, guest_id, status) with status in ('pending','confirmed','cancelled'). Return guest_id and email for guests with no row where status = 'confirmed'.

Solution using LEFT JOIN and NOT EXISTS (two valid patterns)

Option A — anti-join:

SELECT g.guest_id, g.email
FROM guests g
LEFT JOIN bookings b
  ON b.guest_id = g.guest_id AND b.status = 'confirmed'
WHERE b.booking_id IS NULL;
Enter fullscreen mode Exit fullscreen mode

Option B — NOT EXISTS:

SELECT g.guest_id, g.email
FROM guests g
WHERE NOT EXISTS (
  SELECT 1
  FROM bookings b
  WHERE b.guest_id = g.guest_id AND b.status = 'confirmed'
);
Enter fullscreen mode Exit fullscreen mode

Why this works: Option A: unmatched LEFT JOIN row → NULL booking_id; we keep those. Option B: set membership; often clearer when bookings can have multiple conditions. Both are O(joins)-style patterns interviewers like you to name.

Practice

Airbnb-tagged join drills (easy) are a good next step.


3. Window functions and ranking in SQL

Window functions and ranking in SQL for travel and marketplace data

GROUP BY collapses many rows to one per group. Window functions keep every input row and attach a value computed over a window of rows—usually “all rows in the same partition that sort before or with this row.” In travel and listing data, that is how you express first booking to a city, top-N hosts by neighborhood, running revenue, or gaps between events.

Parts of a window: PARTITION BY (optional) splits the table into independent groups—e.g. per guest_id or per (market_id, listing_id). ORDER BY inside OVER (…) defines order within that partition. ROWS BETWEEN … (optional) bounds the frame for running sums (not always needed in interviews if the default frame is correct).

ROW_NUMBER, RANK, and DENSE_RANK

All need ORDER BY in OVER (for meaningful output).

  • ROW_NUMBER(): Unique integers 1..n in the partition after sort—no ties; if two rows tie on the sort key, the engine picks a stable order unless you add tie-break columns (e.g. guest_id ASC).
  • RANK(): Tied rows get the same rank; next rank skips (1, 2, 2, 4).
  • DENSE_RANK(): Tied rows get the same rank; no gaps (1, 2, 2, 3).

Worked example — three trips in one market_id by amt_usd desc:

guest_id amt_usd
10 500
20 400
30 400

ROW_NUMBER might be 1, 2, 3 (with tie-break on guest_id). RANK is 1, 2, 2; DENSE_RANK is 1, 2, 2. Use ROW_NUMBER when the problem says “pick one winner per market” with a clear tie-break; use DENSE_RANK when you want “top 3 price tiers” (ties share a tier).

LAG and LEAD for time gaps and prior values

  • LAG(col, 1) = previous row’s col in partition order; first row in partition is NULL.
  • LEAD(col, 1) = next row (useful for “time to next booking**”).

Worked example (dates): LAG(order_date) per guest_id in order → gap in days = order_date - LAG (for DATE type in PostgreSQL-style SQL, difference is an integer day count; for timestamps, use EPOCH difference / 60 for minutes in session rules).

Running totals (optional pattern)

Cumulative SUM(amount) OVER (PARTITION BY market_id ORDER BY booking_ts ROWS UNBOUNDED PRECEDING AND CURRENT ROW) = running total per market in time order—same idea as “year-to-date” metrics with a clear ORDER BY.

Common beginner mistakes

  • Forgetting ORDER BY in OVER for ROW_NUMBER / RANK / LAG, creating nondeterministic or meaningless output.
  • Using RANK = 1 when two rows tie and the prompt still wants a unique “winner”—add a tie-break to ORDER BY and use ROW_NUMBER, or get clarification.

SQL interview question on top spender per market

Table: spend(guest_id, market_id, amt_usd) (one row per largest single spend event per guest-market if the prompt implies that; if multiple rows per guest-market, clarify—here assume one row per guest-market). Return the guest_id with the highest amt_usd in each market_id (ties on amount: pick lowest guest_id).

Solution using ROW_NUMBER

SELECT guest_id, market_id, amt_usd
FROM (
  SELECT guest_id, market_id, amt_usd,
         ROW_NUMBER() OVER (
           PARTITION BY market_id ORDER BY amt_usd DESC, guest_id ASC
         ) AS rn
  FROM spend
) t
WHERE rn = 1;
Enter fullscreen mode Exit fullscreen mode

Why this works: rn = 1 is exactly one row per market_id; guest_id ASC breaks ties on amt_usd as specified.

SQL interview question on guests whose second booking is to a given city (pattern with LAG)

Table: bookings(guest_id, booking_seq, dest_city, booking_ts) with booking_seq 1, 2, 3, … in true chronological order per guest (simplified). Return guest_id for guests whose second booking has dest_city = 'NYC'.

Solution using LAG to compare sequence

If booking_seq is 1-based and consecutive, a simpler form is: WHERE booking_seq = 2 AND dest_city = 'NYC'. The window version (when you only have booking_ts, not a sequence number):

SELECT guest_id
FROM (
  SELECT guest_id, dest_city,
         ROW_NUMBER() OVER (PARTITION BY guest_id ORDER BY booking_ts) AS rn
  FROM bookings
) t
WHERE rn = 2 AND dest_city = 'NYC';
Enter fullscreen mode Exit fullscreen mode

Why this works: ROW_NUMBER over ORDER BY booking_ts labels 1st, 2nd, 3rd trip per guest; we filter the 2nd row. If the question instead needs LAG, LAG(dest_city) OVER (PARTITION BY guest_id ORDER BY booking_ts) on the third event row gives “second destination*” only if the grain is one row per leg—match the ERD to the business rule*.

Practice


4. Sessionization and time gaps in event SQL

Event gaps, sessions, and half-open time ranges in SQL

Sessionization questions ask you to group events into sessions when silence exceeds a threshold—classic for web, app, and search logs. A common rule: start a new session if the time since the previous event in the same partition (e.g. user_id) is strictly more than 30 minutes (confirm N and strict vs in the spec).

Three-step set-based pattern:

  1. LAG(event_ts) OVER (PARTITION BY user_id ORDER BY event_ts) to align each event with the previous.
  2. is_start: 1 if no previous, else 1 if the gap (seconds/minutes) > threshold, else 0.
  3. session_id: SUM(is_start) OVER (PARTITION BY user_id ORDER BY event_ts)—a running count of session starts, which labels each row with a monotonic session id within the user (1,1,1,2,2,… in order of time).

Half-open reporting windows on dates/timestamps: use [start, end) (see §1) so you do not double count events at a midnight boundary in mixed DATE/TIMESTAMP stores.

Worked example — gap flags and cumulative id:

user_id ts (gap from prev) is_start session_id (cumulative)
u1 10:00 1 1
u1 10:10 10m 0 1
u1 11:00 50m 1 2

(With 30-minute inactivity to start a new session.)

LAG and a gap predicate

After prev_ts is in the row, “is new session?” is prev_ts IS NULL OR (event_ts - prev_ts) > interval '30 minutes' (dialect-specific).

Full session_id sketch (PostgreSQL-style; illustrative)

Worked example:

WITH t AS (
  SELECT user_id, event_id, event_ts,
         LAG(event_ts) OVER (PARTITION BY user_id ORDER BY event_ts) AS prev_ts
  FROM events
),
f AS (
  SELECT *,
         CASE
           WHEN prev_ts IS NULL THEN 1
           WHEN event_ts - prev_ts > INTERVAL '30 minutes' THEN 1
           ELSE 0
         END AS is_start
  FROM t
)
SELECT user_id, event_id, event_ts,
       SUM(is_start) OVER (PARTITION BY user_id ORDER BY event_ts) AS session_id
FROM f;
Enter fullscreen mode Exit fullscreen mode

Why this works: is_start fires at each new session boundary; the prefix sum in time order is the session counter for that user_id.

Common beginner mistakes

  • Time zone drift between ingest and displayfix a convention in the answer.
  • Wrong PARTITION BY (e.g. session across users).
  • Reusing a global session_id without resetting per day when the spec says “sessions that day”—filter events to the day before you count distinct session labels in that range.

SQL interview question on orders with gap > 7 days since previous (pattern)

Table: orders(order_id, guest_id, order_date) (DATE). Return order_id where the day gap from this guest’s previous order_date (by order_date ascending) is > 7; drop a guest’s first order (no predecessor).

Solution using LAG

SELECT order_id, guest_id, order_date, prev_d,
       order_date - prev_d AS gap_days
FROM (
  SELECT order_id, guest_id, order_date,
         LAG(order_date) OVER (
           PARTITION BY guest_id ORDER BY order_id, order_date
         ) AS prev_d
  FROM orders
) t
WHERE prev_d IS NOT NULL
  AND order_date - prev_d > 7;
Enter fullscreen mode Exit fullscreen mode

Why this works: LAG + prev_d IS NOT NULL (first order removed); for app-event gaps in minutes, switch to TIMESTAMP and EPOCH difference.

SQL interview question: sessions per user on one calendar day (outline + solution)

Table: events(user_id, event_ts); 30-minute rule. Count of distinct sessions per user_id for event_ts on 2024-06-01 (UTC; state the assumption). (Use the CTE pattern above, then WHERE event_ts >= … AND event_ts < next day, GROUP BY user_id, COUNT(DISTINCT session_id) or equivalent.)

Solution (pipeline outline)

  1. Compute session_id for all events (full sketch).
  2. WHERE event_ts >= TIMESTAMP '2024-06-01 00:00:00' AND event_ts < TIMESTAMP '2024-06-02 00:00:00'.
  3. SELECT user_id, MAX(session_id) - MIN(session_id) + 1 is not always valid if global ids; safer: COUNT(DISTINCT (user_id, session_id)) per user over filtered rows, or COUNT(*) of session-start events per user after a second pass that marks starts—simplest: COUNT(DISTINCT user_id || '-' || session_id::text) in dialects that lack tuple distinct. (Interview credit is for the LAG + is_start + prefix-sum pattern, not the perfect distinct hack.)

Why this works: The count of logical session segments in the time window is what you need; match your id construction to a per-user or per-day reset if the spec requires it.

Practice

Medium-difficulty SQL on Airbnb often mixes windows with time logic.


5. Subqueries, CASE, and market-style rollups in SQL

CASE, filtered aggregates, and subqueries in SQL

Market or “pulse” reports usually slice a fact table (bookings, listings) by market_id, category, or date, and ask for multiple metrics in one query—conditional counts/sums are the workhorse. You express “only confirmed” or “nights at least 7” either with WHERE (if the whole row is in/out of the population) or with SUM(CASE …) / COUNT(*) FILTER (WHERE …) when the same SELECT must output both all-row counts and restricted metrics.

Searched CASE in aggregates

Idea: Build a 0/1 or NULL flag inside the aggregate. COUNT(*) still counts rows; SUM(CASE WHEN condition THEN 1 END) counts rows where the condition is true and SUM skips NULL.

Worked example (logical): In one GROUP BY market_id, you can output: total rows, SUM(CASE WHEN status = 'confirmed' THEN 1 END), and SUM(CASE WHEN status = 'confirmed' AND nights >= 7 THEN 1 END) in one pass.

FILTER (PostgreSQL) vs CASE

COUNT(*) FILTER (WHERE status = 'confirmed') is often clearer than nested SUM(CASE…); both compile to the same class of plan in many optimizers. If the engine has no FILTER, translate mechanically to SUM(CASE WHEN … THEN 1 ELSE 0 END).

Ratios, percentages, and NULLIF for safe division

Worked example (pattern):

SUM(CASE WHEN is_click THEN 1.0 ELSE 0 END) / NULLIF(SUM(CASE WHEN is_impression THEN 1.0 ELSE 0 END), 0) for CTR at (campaign, user) graindedupe impressions/clicks to one row per pair first if the spec says so, or the denominator will not match the business (see §2 fan-out).

CTEs to fix grain before a join

Idea:Markets with > 5 active listings” may need FROM (SELECT market_id, COUNT(*) AS n FROM listings WHERE is_active GROUP BY market_id) t WHERE n > 5 or a CTE with the same. The CTEs are readability; the key is aggregate in an inner query, filter in an outer query when the predicate is on the aggregate, not a row flag.

Common beginner mistakes

  • Integer division in a/b when both are integers—use * 1.0, CAST(… AS FLOAT), or NUMERIC(12,4).
  • CTR at impression grain vs (user, campaign)match the dedupe rule to the denominator definition.
  • WHERE on SUM(…) in the same level as the raw FROMillegal in standard SQL; use HAVING or a subquery/CTE.

SQL interview question on tiered market counts

Table: bookings(listing_id, market_id, nights, status); status in ('pending','confirmed','cancelled'). Return each market_id, the count of confirmed bookings, and the count of confirmed bookings with nights >= 7.

Solution using conditional counts

SELECT market_id,
       COUNT(*) FILTER (WHERE status = 'confirmed') AS n_conf,
       COUNT(*) FILTER (WHERE status = 'confirmed' AND nights >= 7) AS n_long_stay
FROM bookings
GROUP BY market_id;
Enter fullscreen mode Exit fullscreen mode

Why this works: One pass per market_id; FILTER is the readable form of two independent filters on the confirmed population for the long-stay column.

SQL interview question: markets with high share of long-stay (rate + HAVING)

Table: same as above. Return market_id where among confirmed bookings, at least 20% have nights >= 7, and there are at least 10 confirmed bookings in that market.

Solution using HAVING on a ratio

SELECT market_id
FROM bookings
WHERE status = 'confirmed'
GROUP BY market_id
HAVING COUNT(*) >= 10
   AND SUM(CASE WHEN nights >= 7 THEN 1 ELSE 0 END) * 1.0 / NULLIF(COUNT(*), 0) >= 0.2;
Enter fullscreen mode Exit fullscreen mode

Why this works: Confirmed-only base comes from WHERE status = 'confirmed'. The numerator is long-stay count; denominator is all confirmed rows in the group; HAVING enforces both min volume and min share. Use 1.0 to avoid integer division in engines that are picky.

Practice


6. Python: arrays, intervals, graphs, and pandas for DE screens

Data structures, sweep lines, and tabular code in Python for data engineering

Python screens for data engineering (including marketplace-style Airbnb sets) are usually O(n log n) or O(n): you hash keys, sort for intervals, or BFS/DFS on a small graph. You are not expected to re-prove theorems, but to name complexity, avoid nested scans over the whole list per element, and match the grain of pandas merges to join rules in SQL.

set, dict, and defaultdict for grouping and membership

  • set: O(1) average in and addfirst duplicate, “seen,” uniques.
  • dict / defaultdict: aggregate in one pass by key—same as GROUP BY in SQL. defaultdict(int) avoids KeyError on +=.

Worked example — frequency: ['a','b','a']Counter or dict gives a:2, b:1.

Two pointers (sorted arrays) and one-pass first duplicate

Two pointers often appear in merged or intersection problems after sorting. First duplicate in read order is not two-pointer; it is one left-to-right with a set.

Worked example — merge two sorted unique lists (sketch): i, j from 0, advance the smaller, append to out—O(n+m) time, O(1) extra if output can be in place.

Intervals, sorting by end, and “non-overlapping” / sweep ideas

Manybooking” or reservation problems are intervals [start, end). Sorted by end time, a greedy that takes an interval if its start is after the last end is the classic “max number of non-overlapping” pattern (or min rooms with a min-heap of end times for concurrent count—know which variant the prompt is asking for).

Worked example (intervals, logical): [(1,3),(2,4),(3,5)] — if inclusive overlap at a point, 3 are concurrent at t=3; a sweep counts +1 on start events, -1 on end; max of prefix is concurrent count.

Trees and graphs (inheritance, category trees)

Parent pointer to root or adj list for BFS/DFS from a source; O(V+E). Inheritance of listing attributes in a tree is often: walk up to root until a set value is found (with path or cache in harder variants).

Worked example (verbal): parent[child] = p in a dict—from leaf to root in O(depth)`.

pandas: merge and groupby (and fan-out, again)

pd.merge(orders, dim, on='listing_id', how='left') then g = merged.groupby('market_id')['amount'].sum() is “SQL join + GROUP BY” in memory. If dim has duplicate listing_id rows, sum of a dimension column can double-count (same fan-out story as §2).

Worked example (conceptual pandas): After a one-to-one listing dimension, merge(...).groupby('market_id', as_index=False)['gross_nights'].sum() matches per-market grain.

Python pattern chips (arrays, sweep line, graph, pandas) for Airbnb data engineering interview prep on PipeCode.

Common beginner mistakes

  • O(n²) from for i + x in list where list is large—precompute a set of allowed ids.
  • Off-by-one on inclusive vs half-open interval ends in overlap checks.
  • pandas merge on non-unique right keys and summing a duplicated column from the right.

Python interview question on first duplicate in order ids

Input: ids: list[int]. Return the first value that appears twice in left-to-right order, or None if there is no duplicate.

Solution using a set

python
def first_dup(ids: list[int]) -> int | None:
seen: set[int] = set()
for x in ids:
if x in seen:
return x
seen.add(x)
return None

Why this works: O(n) time, O(n) space; the first time you see a value for the second time, return it.

Python interview question on merge quantities by key

Input: two lists of dicts a and b, each with keys sku, qty. Return one dict sku -> total qty.

Solution using defaultdict

`python
from collections import defaultdict

def merge(a: list[dict], b: list[dict]) -> dict[str, int]:
t: dict[str, int] = defaultdict(int)
for row in a + b:
t[row["sku"]] += row["qty"]
return dict(t)
`

Why this works: One pass; same pattern as GROUP BY sku, SUM(qty).

Python interview question on maximum number of non-overlapping stay intervals (greedy by end time)

Input: intervals: list[tuple[int, int]] as (start, end) on a discrete day line; treat non-overlapping as end < next start (equivalently half-open [start, end) in continuous time). Return the maximum number of pairwise non-overlapping intervals (classic greedy: sort by end, then take the next interval whose start is strictly after the last chosen end).

Solution: sort by end, greedy count

python
def max_non_overlapping(intervals: list[tuple[int, int]]) -> int:
if not intervals:
return 0
intervals = sorted(intervals, key=lambda x: x[1])
last_end = float("-inf")
n = 0
for s, e in intervals:
if s > last_end:
n += 1
last_end = e
return n

Why this works: Earliest-end-first greedy is optimal for max count of non-overlapping intervals on a line when no overlap means start > previous end (adjust to >= if the spec shares boundary days). Confirm inclusive vs exclusive end with the interviewer.

Practice

For graph-indexed practice on the same hub, use Airbnb-tagged graph. The full language filter is Airbnb Python; Python medium on Airbnb matches many longer coding prompts.

General (non–company) drills: Sweep line and graph topic pages—use if you are weak on the shapes that show up in harder prompts.


7. Star schema, rental facts, and SCDs in data engineering

Dimensional modeling and SCDs for warehouse-style questions

Star schema is the default whiteboard structure: a fact table of measures at a clear grain + foreign keys to dimension tables that hold descriptive attributes. For short-term rental (STR) analytics, your fact might be booking-level, night-level, or daily listing-level—state the grain first in any design answer.

Star vs snowflake (one sentence each)

  • Star: dimensions are denormalized into wide tables (e.g. city name on dim_listing); fewer joins, more redundancy.
  • Snowflake: dimensions are normalized (e.g. city in dim_geo); more joins, less duplicate text. Tradeoff: query simplicity vs storage / consistency of lookup tables.

Fact grain for STR: booking vs nightly

Worked example (verbal): One row per booking → natural for gross booking value, cancel flags, host payout in one row. One row per (listing, night) in an occupancy calendar → natural for nightly rate × nights, blocked calendar, and revenue when each night is priced. Do not put night-level and booking-level facts in the same table without a bridge or a derived ETL—interviewers will press you on that.

Surrogate keys (listing_sk, guest_sk): The warehouse uses integers that do not change when a source system merges accounts; facts store listing_sk, not raw PMS ids when dimensions are conformed.

SCD Type 1 vs Type 2 for listing attributes

  • Type 1 — overwrite: one row per natural listing; fix typos or only care about current city—no history.
  • Type 2 — new row for change: new listing_sk (or new row in dim_listing with valid_from / valid_to / is_current) when bedrooms or city change; as-of reporting joins the version that was valid on the event date.

Worked example (table) — one natural listing, two SCD2 rows:

listing_sk natural_id city valid_from valid_to is_current
1001 L7 NYC 2020-01-01 2024-12-31 false
1002 L7 MIA 2025-01-01 NULL true

A 2024 booking should join listing_sk = 1001; a current-listings report uses listing_sk = 1002.

Bridge and degenerate dimensions (naming only)

  • Bridge if you have many-to-many (e.g. amenity flags → bridge_listing_amenity).
  • Degenerate dims in the fact: booking_id on a line-like fact is often stored in the fact and not in a separate 1:1 table.

Star schema and SCD Type 1 vs 2 for listings and host attributes in Airbnb data modeling interview prep on PipeCode.

Common beginner mistakes

  • Two grains in one fact without documentation and lineage from one canonical fact.
  • Type 1 for geography when revenue in 2023 by old city is still required.
  • Forgetting late-arriving bookings in idempotent load discussion (verbally).

Data engineering interview question on “current” vs historical listing

Explain how to store listing attributes (e.g. city, accommodates) so a 2024 revenue report for “NYC” is stable when a listing moves to another city in 2025.

Solution using SCD2 on listing dimension (verbal + sketch)

Answer outline: SCD2 rows on dim_listing with valid_from, valid_to (or is_current); fact rows for nightly or booking revenue reference listing_sk active at fact date. NYC 2024 joins to the version valid for each date, not the 2025 row.

Why this works: As-of reporting stays correct; for “current” listing rows, analysts filter on is_current, while historical facts keep their old listing_sk keys.

Data engineering interview question: one fact for nightly revenue vs one for bookings (verbal)

Prompt: We need both (a) revenue by booking for cancellations and (b) revenue by night for calendar analytics. One table or two?

Solution (verbal + sketch)

  • Prefer at least two facts (or one fact at nightly grain plus a booking fact at booking grain): (a) fact_booking one row per booking for gross booking value, status, payout line items. (b) fact_night one row per (booking, night) or (listing, night) for nightly rate, promo allocation, and night-level experiments. Conformed dim_listing and date make rollups consistent; a data mart can pre-join for analysts if needed.

Why this works: Different grainsdifferent facts (or a nightly bridge with very explicit rules). Stating the grain in one clear sentence is often the scoring move.

Practice

Use the Airbnb hub to pair modeling study with hard-difficulty problems that mention data warehouse or dimensional modeling. PipeCode also offers interview-first courses; see Data modeling for data engineering interviews and ETL system design for data engineering interviews.


Tips to crack Airbnb data engineering interviews

Data engineering interview preparation for a marketplace and travel company like Airbnb means being fluent in SQL grain and windows, in time and session questions, in practical Python (not only CRUD), and in a credible warehouse story on dimensions and SCDs.

  1. Re-read the 20-problem mix on the Airbnb company page: it skews medium SQL, with graph and data modeling at the hard end of the set—plan reps accordingly.
  2. For SQL, always say one sentence of grain before you write FROM. Name NULL and join cardinality assumptions.
  3. For sessionization and windows, practice LAG / LEAD and ROW_NUMBER on paper until the partition and order are automatic.
  4. For Python, get comfortable with dict/defaultdict, heap/sort+sweep for intervals, and read-only graph walks; add pandas merges if your loop historically uses them.
  5. For modeling, memorize one tight example of SCD2 with as-of joins; link it to nightly vs booking facts for STR (short-term rental) metrics.

Across the platform, 450+ problems with tests and AI feedback are available; use topic and company filters to stay close to the skills above.


Frequently asked questions

What is the Airbnb data engineering interview process?

Rounds often include a recruiter or hiring manager screen, one or more technical sessions (SQL and Python are common), data-centric system or pipeline design, and behavioral interviews. Exact steps vary by level and org—treat the curated Airbnb problem list as pattern prep, not a guarantee of on-site order.

What topics should I study for an Airbnb data engineering interview?

SQL: joins, aggregations, window functions, subqueries, deduplication, and time and session logic. Python: arrays and hashes, intervals (including sweep line), graphs for hierarchical data, and sometimes pandas. Modeling: star schema, facts at clear grain, and SCDs for listing-style change. This guide maps each area to practice links—see the sections and table above.

How difficult are Airbnb data engineering interview questions?

A representative 20-problem company set is often tiered as 6 easy, 11 medium, and 3 hardmedium SQL and join work is the spine; hard items may be data warehouse design or non-trivial graph or DP-style Python depending on the pack. Calibrate with easy vs medium on the same hub.

How should I prepare for an Airbnb data engineering interview?

Solve a mix of company-tagged and global topic problems, verbalize your assumptions, and revisit every miss with a one-paragraph post-mortem (what grain you chose and why). Aim for dozens of reps across SQL and Python before a onsite-style day—not a single cram night.

What programming languages does Airbnb test in data engineering interviews?

SQL and Python are the main language filters in many DE loops; some teams add verbal modeling or system design without a new language. For reps that match, start from Airbnb SQL and Airbnb Python on PipeCode.

Where can I practice Airbnb-style data engineering questions with tests and feedback?

PipeCode offers 450+ problems with automated tests and feedback; the company page for Airbnb is /explore/practice/company/airbnb. You can also browse all company hubs to compare volume and topics across employers.

Start practicing Airbnb data engineering problems

(Closing CTA is injected by regen_blog15.pyconvert_cta after regeneration.)

Top comments (0)