Gowtham Potureddi

Posted on Apr 26

Airbnb Data Engineering Interview Questions & Prep

#dataengineering #interview #python #sql

Airbnb data engineering interview questions lean on SQL for marketplace-style analytics (aggregations, joins, window functions, and sessionization from event streams) and Python for data structures, intervals, and occasional graph or pandas-style coding. Deeper rounds often add verbal dimensional modeling and data warehouse design (star schema, SCDs, booking grain).

Everything below uses original teaching examples only—not proprietary interview items. The guide is structured for data engineering interview prep: worked examples in SQL and Python, a data modeling section, and topic-aligned practice via PipeCode’s Airbnb company hub. Narrate grain, join semantics, and NULL behavior as you go; interviewers care about the reasoning, not just the final answer.

#	Topic	Why it shows up
1	Aggregation & `GROUP BY` / `HAVING`	Nightly price by market, booking counts, O–D (origin–destination) pairs, listing metrics.
2	Joins & anti-joins	Listings to bookings, page events to sessions, avoid fan-out.
3	Window functions	“First” booking, distinct visitors per scope, `LAG` / `LEAD`, ranking.
4	Gaps, sessions & time	Session ids from inactivity (e.g. 30+ minutes), half-open time ranges.
5	Subqueries, `CASE`, market-style reports	Filtered rollups, market pulse, conditional metrics.
6	Python: arrays, intervals, graphs, `pandas`	Max revenue from bookings, multi-dimensional iterators, attribute inheritance, merges.
7	Star schema, rental facts & SCDs	STR grain, slowly changing listing or host attributes.

1. Aggregation and GROUP BY in marketplace SQL

Aggregation and GROUP BY in SQL for data engineering

Picture a table with many detail rows—often one row per booking or per night line item. Aggregation means: turn lots of rows into one summary value (or a few) per bucket. The bucket is whatever the prompt names: per market, per listing, per origin–destination pair, or per (listing, week). GROUP BY names that bucket: every distinct combination of the grouping keys is one group; the engine runs SUM, COUNT, AVG inside each group separately.

Report grain (say this in the interview): After the query, ask, “One output row is what, in business terms?” If the question asked “gross booking dollars per market,” the grain is market. If you GROUP BY at listing by mistake, you are answering a different question.

`SUM`, `AVG`, and `NULL`

SUM(col) adds non-null numerics in the group. NULL is skipped (not coerced to 0). If every value in the group is NULL, SUM is usually NULL, not 0.
AVG(col) is sum of non-null values ÷ count of non-null values in that group.

Worked example: In one market_id group, amount_usd values 200, NULL, 300 → SUM(amount_usd) = 500, AVG(amount_usd) = 250 (only two values averaged).

`COUNT(*)` vs `COUNT(col)` vs `COUNT(DISTINCT col)`

COUNT(*) counts rows in the group, including rows where some column is NULL.
COUNT(col) counts rows where col is not NULL.
COUNT(DISTINCT col) counts unique non-null values—required after a join that multiplies a guest or listing across many rows, when the question is “how many guests” not “how many booking rows.”

Worked example: After a one-to-many join, two rows for the same guest_id → COUNT(*) = 2, COUNT(DISTINCT guest_id) = 1.

`MIN` / `MAX`

Use for prices, dates, ranks. NULL is skipped (same as in SUM semantics). In marketplace data, MAX(created_at) is often “latest booking in this group” before you switch to a window in §3.

`WHERE` vs `HAVING`

WHERE filters input rows before GROUP BY—predicates on raw columns of the table as seen before grouping.
HAVING filters after GROUP BY on aggregates—e.g. HAVING SUM(amount) > 1000 or HAVING COUNT(*) >= 2.

Worked example — per-market subtotals, two different filters:

booking_id	market_id	subtotal	is_cancelled
b1	M1	80	false
b2	M1	30	true
b3	M2	200	false

WHERE NOT is_cancelled then GROUP BY market_id, SUM(subtotal) → M1 = 80, M2 = 200 (canceled row dropped first).
Same table without WHERE, then HAVING SUM(subtotal) > 50 (sum of all subtotals in the group) → both markets can have different SUM than in the first bullet—be explicit which population the question means (all rows vs. completed only).

Rule of thumb: If the filter uses an aggregate of the group (SUM, COUNT, AVG), it belongs in HAVING, not WHERE. If it only uses one row at a time (e.g. status = 'confirmed'), use WHERE.

Half-open date ranges and revenue by period

Worked example: Nightly amount for calendar year 2024 per market_id:

SELECT market_id, SUM(amount_usd) AS rev
FROM night_stays
WHERE stay_date >= DATE '2024-01-01' AND stay_date < DATE '2025-01-01'
GROUP BY market_id;

Takeaway: The interval [2024-01-01, 2025-01-01) (half-open) avoids double counting a boundary if you also store timestamps at midnight. Say timezone if the table is TIMESTAMPTZ and the business rule is not UTC.

Conditional aggregation: `SUM(CASE …)` and net vs gross

Idea: Sum or count only rows that meet a business rule without a separate WHERE that would drop the rest of the group from the query entirely when you need both “gross” and “net” in one SELECT, or you must include partial rows in the denominator of a rate.

Worked example — net booking dollars after cancellations (same as before, now with **full pattern):**

booking_id	market_id	amount_usd	is_cancelled
b1	M1	200	false
b2	M1	100	true

SUM(CASE WHEN NOT is_cancelled THEN amount_usd ELSE 0 END) = 200; gross would be SUM(amount_usd) = 300 if the prompt asked for both columns.

Common beginner mistakes

Putting a condition on SUM(x) in WHERE instead of HAVING.
COUNT(*) when the business asked for unique guests or listings—use COUNT(DISTINCT …) at the right grain.
Joining one-to-many before GROUP BY and summing a dimension field that repeats—double count; fix at grain or dedupe first (see §2).

SQL interview question on net booking volume by market

Table: bookings(booking_id, market_id, amount_usd, is_cancelled). Return each market_id and net booking volume (sum of amount_usd where NOT is_cancelled) for markets where that net sum is strictly positive, ordered by net volume descending.

Solution using `GROUP BY` and conditional `SUM`

SELECT market_id,
       SUM(CASE WHEN NOT is_cancelled THEN amount_usd ELSE 0 END) AS net_vol
FROM bookings
GROUP BY market_id
HAVING SUM(CASE WHEN NOT is_cancelled THEN amount_usd ELSE 0 END) > 0
ORDER BY net_vol DESC;

Why this works: Cancellations contribute 0 in the CASE; HAVING filters per-market aggregates, not per-row. Grain: one row per market_id.

SQL interview question on distinct active guests by market (post-join)

Table: bookings(booking_id, market_id, guest_id, is_cancelled) (one row per booking). Return each market_id and COUNT(DISTINCT guest_id) among non-cancelled bookings only, ordered by that count descending.

Solution using `COUNT(DISTINCT …)` and `WHERE`

SELECT market_id, COUNT(DISTINCT guest_id) AS active_guests
FROM bookings
WHERE NOT is_cancelled
GROUP BY market_id
ORDER BY active_guests DESC;

Why this works: WHERE restricts to completed rows before grouping; COUNT(DISTINCT guest_id) counts people, not booking rows, at market grain. If a guest has two completed bookings in the same market, they still count once.

Practice

SQL · Topic — Aggregation problems (all companies)
COMPANY · Airbnb — aggregation Airbnb-tagged aggregation

When you want SQL by difficulty on the same hub, open Airbnb SQL (all difficulties) or Airbnb easy SQL to warm up.

2. Joins, anti-joins, and deduplication in SQL

Joins and deduplication in SQL for data engineering

A join answers: “For each row in A, which row(s) in B match the ON condition?” In marketplace analytics, you join facts (bookings, page views, impressions) to dimensions (listing, market, user profile) to get attributes for reporting. You are evaluated on: inner vs outer use, anti-join for “has no …”, and fan-out when a key is not unique on one side.

Fact vs dimension and grain (how interviewers want you to talk about joins):

Fact: many rows, often one row per event or booking—the thing you SUM or COUNT.
Dimension: one row per entity (in an ideal, clean model)—listing, host, city—enriched via the join. If a dimension is not unique (duplicate listing_id rows), a join to facts can multiply rows; summing a dimension’s attribute then inflates the metric.

Inner join

Result: Only rows that satisfy the ON on both sides. Unmatched keys disappear.

Worked example — bookings inner join listings on listing_id: if a booking’s listing_id was deleted in source and is missing in listings, that booking vanishes from the result. If the question is “revenue for known listings,” that may be right; if it is “all revenue, including orphan bookings,” you need a left or fuzzy key strategy—state your assumption out loud.

Left join and the anti-join pattern

LEFT JOIN returns all left-side rows, padding right columns with NULL when there is no match. Anti-join: LEFT JOIN … WHERE right.not_null_col IS NULL keeps only left rows with no partner in B. Classic: “Guests who have never completed a booking.”

Worked example (mini): guests ids G1, G2; bookings has G1 only. FROM guests g LEFT JOIN bookings b ON g.id = b.guest_id WHERE b.booking_id IS NULL returns G2 only.

Semi-join: `EXISTS` / `IN` (keep left rows, do not add B’s columns)

“Listings that had at least one booking in NYC” is often a filter on listings, not a full merge. WHERE EXISTS (SELECT 1 FROM bookings … WHERE … listing_id = listings.id) keeps one row per listing; compare to joining all bookings, then DISTINCT, which is heavier and easier to get wrong on grain.

Fan-out: when one `listing` matches many `booking` rows

Bug pattern: SUM(dimension.accommodates) after bookings ⋈ listings with one row per booking will add the same accommodates once per booking, which is not “total capacity in the market” but “sum of capacity over booking rows*.” **Fixes:* SUM at listing grain in a subquery, MAX/MIN of attribute per listing, or COUNT(DISTINCT listing_id)-style questions instead of a blind sum.

Worked example:

guest_id (dim)	credit_score
1	720

booking_id	guest_id
b1	1
b2	1

JOIN gives two rows; SUM(credit_score) = 1440 if you are not careful—wrong for “system-wide average credit of guests with bookings*.” Use *AVG per guest** in a CTE, or MAX(credit_score) if one value per guest_id is intended in the result.

Deduplication: `ROW_NUMBER` vs `DISTINCT ON`

“Latest base_price per listing” in a type-2-ish history table: prefer ROW_NUMBER() OVER (PARTITION BY listing_id ORDER BY as_of_date DESC) AS rn then WHERE rn = 1 in an outer filter so you control ties. DISTINCT ON (listing_id) … ORDER BY … (PostgreSQL) is compact; tie semantics must match the business rule (e.g. break ties on ingest_id).

Worked example (pattern only):

SELECT listing_id, base_price, as_of_date
FROM (
  SELECT listing_id, base_price, as_of_date,
         ROW_NUMBER() OVER (
           PARTITION BY listing_id ORDER BY as_of_date DESC, ingest_id DESC
         ) AS rn
  FROM listing_price_history
) t
WHERE rn = 1;

Common beginner mistakes

Joining on NULL = NULL: not true; rows do not match—use COALESCE or explicit IS NULL logic only if the spec allows it.
Inner join when the question needs kept empty sides (left / outer).
Aggregating at join-inflated grain without a pre-aggregate or distinct step.

Joins-based interview question on O–D pair volume

Tables: bookings(booking_id, origin_city, dest_city, amount_usd, is_cancelled). Return each ordered pair (origin_city, dest_city) with count of non-cancelled bookings, excluding pairs where count < 2. Order by count descending.

Solution using `GROUP BY` and `HAVING`

SELECT origin_city, dest_city, COUNT(*) AS n
FROM bookings
WHERE NOT is_cancelled
GROUP BY origin_city, dest_city
HAVING COUNT(*) >= 2
ORDER BY n DESC;

Why this works: Grain = (origin, dest); WHERE filters rows; HAVING enforces the at least two completed bookings rule on the group.

Joins-based interview question on guests with no completed booking

Tables: guests(guest_id, email), bookings(booking_id, guest_id, status) with status in ('pending','confirmed','cancelled'). Return guest_id and email for guests with no row where status = 'confirmed'.

Solution using `LEFT JOIN` and `NOT EXISTS` (two valid patterns)

Option A — anti-join:

SELECT g.guest_id, g.email
FROM guests g
LEFT JOIN bookings b
  ON b.guest_id = g.guest_id AND b.status = 'confirmed'
WHERE b.booking_id IS NULL;

Option B — NOT EXISTS:

SELECT g.guest_id, g.email
FROM guests g
WHERE NOT EXISTS (
  SELECT 1
  FROM bookings b
  WHERE b.guest_id = g.guest_id AND b.status = 'confirmed'
);

Why this works: Option A: unmatched LEFT JOIN row → NULL booking_id; we keep those. Option B: set membership; often clearer when bookings can have multiple conditions. Both are O(joins)-style patterns interviewers like you to name.

Practice

SQL · Topic — Joins & deduplication
COMPANY · Airbnb — joins Airbnb-tagged joins

Airbnb-tagged join drills (easy) are a good next step.

3. Window functions and ranking in SQL

Window functions and ranking in SQL for travel and marketplace data

GROUP BY collapses many rows to one per group. Window functions keep every input row and attach a value computed over a window of rows—usually “all rows in the same partition that sort before or with this row.” In travel and listing data, that is how you express first booking to a city, top-N hosts by neighborhood, running revenue, or gaps between events.

Parts of a window: PARTITION BY (optional) splits the table into independent groups—e.g. per guest_id or per (market_id, listing_id). ORDER BY inside OVER (…) defines order within that partition. ROWS BETWEEN … (optional) bounds the frame for running sums (not always needed in interviews if the default frame is correct).

`ROW_NUMBER`, `RANK`, and `DENSE_RANK`

All need ORDER BY in OVER (for meaningful output).

ROW_NUMBER(): Unique integers 1..n in the partition after sort—no ties; if two rows tie on the sort key, the engine picks a stable order unless you add tie-break columns (e.g. guest_id ASC).
RANK(): Tied rows get the same rank; next rank skips (1, 2, 2, 4).
DENSE_RANK(): Tied rows get the same rank; no gaps (1, 2, 2, 3).

Worked example — three trips in one market_id by amt_usd desc:

guest_id	amt_usd
10	500
20	400
30	400

ROW_NUMBER might be 1, 2, 3 (with tie-break on guest_id). RANK is 1, 2, 2; DENSE_RANK is 1, 2, 2. Use ROW_NUMBER when the problem says “pick one winner per market” with a clear tie-break; use DENSE_RANK when you want “top 3 price tiers” (ties share a tier).

`LAG` and `LEAD` for time gaps and prior values

LAG(col, 1) = previous row’s col in partition order; first row in partition is NULL.
LEAD(col, 1) = next row (useful for “time to next booking**”).

Worked example (dates): LAG(order_date) per guest_id in order → gap in days = order_date - LAG (for DATE type in PostgreSQL-style SQL, difference is an integer day count; for timestamps, use EPOCH difference / 60 for minutes in session rules).

Running totals (optional pattern)

Cumulative SUM(amount) OVER (PARTITION BY market_id ORDER BY booking_ts ROWS UNBOUNDED PRECEDING AND CURRENT ROW) = running total per market in time order—same idea as “year-to-date” metrics with a clear ORDER BY.

Common beginner mistakes

Forgetting ORDER BY in OVER for ROW_NUMBER / RANK / LAG, creating nondeterministic or meaningless output.
Using RANK = 1 when two rows tie and the prompt still wants a unique “winner”—add a tie-break to ORDER BY and use ROW_NUMBER, or get clarification.

SQL interview question on top spender per market

Table: spend(guest_id, market_id, amt_usd) (one row per largest single spend event per guest-market if the prompt implies that; if multiple rows per guest-market, clarify—here assume one row per guest-market). Return the guest_id with the highest amt_usd in each market_id (ties on amount: pick lowest guest_id).

Solution using `ROW_NUMBER`

SELECT guest_id, market_id, amt_usd
FROM (
  SELECT guest_id, market_id, amt_usd,
         ROW_NUMBER() OVER (
           PARTITION BY market_id ORDER BY amt_usd DESC, guest_id ASC
         ) AS rn
  FROM spend
) t
WHERE rn = 1;

Why this works: rn = 1 is exactly one row per market_id; guest_id ASC breaks ties on amt_usd as specified.

SQL interview question on guests whose second booking is to a given city (pattern with `LAG`)

Table: bookings(guest_id, booking_seq, dest_city, booking_ts) with booking_seq 1, 2, 3, … in true chronological order per guest (simplified). Return guest_id for guests whose second booking has dest_city = 'NYC'.

Solution using `LAG` to compare sequence

If booking_seq is 1-based and consecutive, a simpler form is: WHERE booking_seq = 2 AND dest_city = 'NYC'. The window version (when you only have booking_ts, not a sequence number):

SELECT guest_id
FROM (
  SELECT guest_id, dest_city,
         ROW_NUMBER() OVER (PARTITION BY guest_id ORDER BY booking_ts) AS rn
  FROM bookings
) t
WHERE rn = 2 AND dest_city = 'NYC';

Why this works: ROW_NUMBER over ORDER BY booking_ts labels 1st, 2nd, 3rd trip per guest; we filter the 2nd row. If the question instead needs LAG, LAG(dest_city) OVER (PARTITION BY guest_id ORDER BY booking_ts) on the third event row gives “second destination*” only if the grain is one row per leg—match the ERD to the business rule*.

Practice

SQL · Topic — Window functions (SQL)
COMPANY · Airbnb — windows Airbnb-tagged window functions

4. Sessionization and time gaps in event SQL

Event gaps, sessions, and half-open time ranges in SQL

Sessionization questions ask you to group events into sessions when silence exceeds a threshold—classic for web, app, and search logs. A common rule: start a new session if the time since the previous event in the same partition (e.g. user_id) is strictly more than 30 minutes (confirm N and strict vs ≥ in the spec).

Three-step set-based pattern:

LAG(event_ts) OVER (PARTITION BY user_id ORDER BY event_ts) to align each event with the previous.
is_start: 1 if no previous, else 1 if the gap (seconds/minutes) > threshold, else 0.
session_id: SUM(is_start) OVER (PARTITION BY user_id ORDER BY event_ts)—a running count of session starts, which labels each row with a monotonic session id within the user (1,1,1,2,2,… in order of time).

Half-open reporting windows on dates/timestamps: use [start, end) (see §1) so you do not double count events at a midnight boundary in mixed DATE/TIMESTAMP stores.

Worked example — gap flags and cumulative id:

user_id	ts	(gap from prev)	is_start	session_id (cumulative)
u1	10:00	—	1	1
u1	10:10	10m	0	1
u1	11:00	50m	1	2

(With 30-minute inactivity to start a new session.)

`LAG` and a gap predicate

After prev_ts is in the row, “is new session?” is prev_ts IS NULL OR (event_ts - prev_ts) > interval '30 minutes' (dialect-specific).

Full `session_id` sketch (PostgreSQL-style; illustrative)

Worked example:

WITH t AS (
  SELECT user_id, event_id, event_ts,
         LAG(event_ts) OVER (PARTITION BY user_id ORDER BY event_ts) AS prev_ts
  FROM events
),
f AS (
  SELECT *,
         CASE
           WHEN prev_ts IS NULL THEN 1
           WHEN event_ts - prev_ts > INTERVAL '30 minutes' THEN 1
           ELSE 0
         END AS is_start
  FROM t
)
SELECT user_id, event_id, event_ts,
       SUM(is_start) OVER (PARTITION BY user_id ORDER BY event_ts) AS session_id
FROM f;

Why this works: is_start fires at each new session boundary; the prefix sum in time order is the session counter for that user_id.

Common beginner mistakes

Time zone drift between ingest and display—fix a convention in the answer.
Wrong PARTITION BY (e.g. session across users).
Reusing a global session_id without resetting per day when the spec says “sessions that day”—filter events to the day before you count distinct session labels in that range.

SQL interview question on `orders` with gap > 7 days since previous (pattern)

Table: orders(order_id, guest_id, order_date) (DATE). Return order_id where the day gap from this guest’s previous order_date (by order_date ascending) is > 7; drop a guest’s first order (no predecessor).

Solution using `LAG`

SELECT order_id, guest_id, order_date, prev_d,
       order_date - prev_d AS gap_days
FROM (
  SELECT order_id, guest_id, order_date,
         LAG(order_date) OVER (
           PARTITION BY guest_id ORDER BY order_id, order_date
         ) AS prev_d
  FROM orders
) t
WHERE prev_d IS NOT NULL
  AND order_date - prev_d > 7;

Why this works: LAG + prev_d IS NOT NULL (first order removed); for app-event gaps in minutes, switch to TIMESTAMP and EPOCH difference.

SQL interview question: sessions per user on one calendar day (outline + solution)

Table: events(user_id, event_ts); 30-minute rule. Count of distinct sessions per user_id for event_ts on 2024-06-01 (UTC; state the assumption). (Use the CTE pattern above, then WHERE event_ts >= … AND event_ts < next day, GROUP BY user_id, COUNT(DISTINCT session_id) or equivalent.)

Solution (pipeline outline)

Compute session_id for all events (full sketch).
WHERE event_ts >= TIMESTAMP '2024-06-01 00:00:00' AND event_ts < TIMESTAMP '2024-06-02 00:00:00'.
SELECT user_id, MAX(session_id) - MIN(session_id) + 1 is not always valid if global ids; safer: COUNT(DISTINCT (user_id, session_id)) per user over filtered rows, or COUNT(*) of session-start events per user after a second pass that marks starts—simplest: COUNT(DISTINCT user_id || '-' || session_id::text) in dialects that lack tuple distinct. (Interview credit is for the LAG + is_start + prefix-sum pattern, not the perfect distinct hack.)

Why this works: The count of logical session segments in the time window is what you need; match your id construction to a per-user or per-day reset if the spec requires it.

Practice

SQL · Topic — Time-series SQL
COMPANY · Airbnb — hub All Airbnb problems

Medium-difficulty SQL on Airbnb often mixes windows with time logic.

5. Subqueries, `CASE`, and market-style rollups in SQL

`CASE`, filtered aggregates, and subqueries in SQL

Market or “pulse” reports usually slice a fact table (bookings, listings) by market_id, category, or date, and ask for multiple metrics in one query—conditional counts/sums are the workhorse. You express “only confirmed” or “nights at least 7” either with WHERE (if the whole row is in/out of the population) or with SUM(CASE …) / COUNT(*) FILTER (WHERE …) when the same SELECT must output both all-row counts and restricted metrics.

Searched `CASE` in aggregates

Idea: Build a 0/1 or NULL flag inside the aggregate. COUNT(*) still counts rows; SUM(CASE WHEN condition THEN 1 END) counts rows where the condition is true and SUM skips NULL.

Worked example (logical): In one GROUP BY market_id, you can output: total rows, SUM(CASE WHEN status = 'confirmed' THEN 1 END), and SUM(CASE WHEN status = 'confirmed' AND nights >= 7 THEN 1 END) in one pass.

`FILTER` (PostgreSQL) vs `CASE`

COUNT(*) FILTER (WHERE status = 'confirmed') is often clearer than nested SUM(CASE…); both compile to the same class of plan in many optimizers. If the engine has no FILTER, translate mechanically to SUM(CASE WHEN … THEN 1 ELSE 0 END).

Ratios, percentages, and `NULLIF` for safe division

Worked example (pattern):

SUM(CASE WHEN is_click THEN 1.0 ELSE 0 END) / NULLIF(SUM(CASE WHEN is_impression THEN 1.0 ELSE 0 END), 0) for CTR at (campaign, user) grain—dedupe impressions/clicks to one row per pair first if the spec says so, or the denominator will not match the business (see §2 fan-out).

CTEs to fix grain before a join

Idea: “Markets with > 5 active listings” may need FROM (SELECT market_id, COUNT(*) AS n FROM listings WHERE is_active GROUP BY market_id) t WHERE n > 5 or a CTE with the same. The CTEs are readability; the key is aggregate in an inner query, filter in an outer query when the predicate is on the aggregate, not a row flag.

Common beginner mistakes

Integer division in a/b when both are integers—use * 1.0, CAST(… AS FLOAT), or NUMERIC(12,4).
CTR at impression grain vs (user, campaign)—match the dedupe rule to the denominator definition.
WHERE on SUM(…) in the same level as the raw FROM—illegal in standard SQL; use HAVING or a subquery/CTE.

SQL interview question on tiered market counts

Table: bookings(listing_id, market_id, nights, status); status in ('pending','confirmed','cancelled'). Return each market_id, the count of confirmed bookings, and the count of confirmed bookings with nights >= 7.

Solution using conditional counts

SELECT market_id,
       COUNT(*) FILTER (WHERE status = 'confirmed') AS n_conf,
       COUNT(*) FILTER (WHERE status = 'confirmed' AND nights >= 7) AS n_long_stay
FROM bookings
GROUP BY market_id;

Why this works: One pass per market_id; FILTER is the readable form of two independent filters on the confirmed population for the long-stay column.

SQL interview question: markets with high share of long-stay (rate + `HAVING`)

Table: same as above. Return market_id where among confirmed bookings, at least 20% have nights >= 7, and there are at least 10 confirmed bookings in that market.

Solution using `HAVING` on a ratio

SELECT market_id
FROM bookings
WHERE status = 'confirmed'
GROUP BY market_id
HAVING COUNT(*) >= 10
   AND SUM(CASE WHEN nights >= 7 THEN 1 ELSE 0 END) * 1.0 / NULLIF(COUNT(*), 0) >= 0.2;

Why this works: Confirmed-only base comes from WHERE status = 'confirmed'. The numerator is long-stay count; denominator is all confirmed rows in the group; HAVING enforces both min volume and min share. Use 1.0 to avoid integer division in engines that are picky.

Practice

SQL · Topic — Conditional logic
SQL · Topic — Subqueries

6. Python: arrays, intervals, graphs, and `pandas` for DE screens

Data structures, sweep lines, and tabular code in Python for data engineering

Python screens for data engineering (including marketplace-style Airbnb sets) are usually O(n log n) or O(n): you hash keys, sort for intervals, or BFS/DFS on a small graph. You are not expected to re-prove theorems, but to name complexity, avoid nested scans over the whole list per element, and match the grain of pandas merges to join rules in SQL.

`set`, `dict`, and `defaultdict` for grouping and membership

set: O(1) average in and add—first duplicate, “seen,” uniques.
dict / defaultdict: aggregate in one pass by key—same as GROUP BY in SQL. defaultdict(int) avoids KeyError on +=.

Worked example — frequency: ['a','b','a'] → Counter or dict gives a:2, b:1.

Two pointers (sorted arrays) and one-pass first duplicate

Two pointers often appear in merged or intersection problems after sorting. First duplicate in read order is not two-pointer; it is one left-to-right with a set.

Worked example — merge two sorted unique lists (sketch): i, j from 0, advance the smaller, append to out—O(n+m) time, O(1) extra if output can be in place.

Intervals, sorting by end, and “non-overlapping” / sweep ideas

Many “booking” or reservation problems are intervals [start, end). Sorted by end time, a greedy that takes an interval if its start is after the last end is the classic “max number of non-overlapping” pattern (or min rooms with a min-heap of end times for concurrent count—know which variant the prompt is asking for).

Worked example (intervals, logical): [(1,3),(2,4),(3,5)] — if inclusive overlap at a point, 3 are concurrent at t=3; a sweep counts +1 on start events, -1 on end; max of prefix is concurrent count.

Trees and graphs (inheritance, category trees)

Parent pointer to root or adj list for BFS/DFS from a source; O(V+E). Inheritance of listing attributes in a tree is often: walk up to root until a set value is found (with path or cache in harder variants).

Worked example (verbal): parent[child] = p in a dict—from leaf to root in O(depth)`.

`pandas`: `merge` and `groupby` (and fan-out, again)

pd.merge(orders, dim, on='listing_id', how='left') then g = merged.groupby('market_id')['amount'].sum() is “SQL join + GROUP BY” in memory. If dim has duplicate listing_id rows, sum of a dimension column can double-count (same fan-out story as §2).

Worked example (conceptual pandas): After a one-to-one listing dimension, merge(...).groupby('market_id', as_index=False)['gross_nights'].sum() matches per-market grain.

Common beginner mistakes

O(n²) from for i + x in list where list is large—precompute a set of allowed ids.
Off-by-one on inclusive vs half-open interval ends in overlap checks.
pandas merge on non-unique right keys and summing a duplicated column from the right.

Python interview question on first duplicate in order ids

Input: ids: list[int]. Return the first value that appears twice in left-to-right order, or None if there is no duplicate.

Solution using a set

python def first_dup(ids: list[int]) -> int | None: seen: set[int] = set() for x in ids: if x in seen: return x seen.add(x) return None

Why this works: O(n) time, O(n) space; the first time you see a value for the second time, return it.

Python interview question on merge quantities by key

Input: two lists of dicts a and b, each with keys sku, qty. Return one dict sku -> total qty.

Solution using `defaultdict`

`python
from collections import defaultdict

def merge(a: list[dict], b: list[dict]) -> dict[str, int]:
t: dict[str, int] = defaultdict(int)
for row in a + b:
t[row["sku"]] += row["qty"]
return dict(t)
`

Why this works: One pass; same pattern as GROUP BY sku, SUM(qty).

Python interview question on maximum number of non-overlapping stay intervals (greedy by end time)

Input: intervals: list[tuple[int, int]] as (start, end) on a discrete day line; treat non-overlapping as end < next start (equivalently half-open [start, end) in continuous time). Return the maximum number of pairwise non-overlapping intervals (classic greedy: sort by end, then take the next interval whose start is strictly after the last chosen end).

Solution: sort by end, greedy count

python def max_non_overlapping(intervals: list[tuple[int, int]]) -> int: if not intervals: return 0 intervals = sorted(intervals, key=lambda x: x[1]) last_end = float("-inf") n = 0 for s, e in intervals: if s > last_end: n += 1 last_end = e return n

Why this works: Earliest-end-first greedy is optimal for max count of non-overlapping intervals on a line when no overlap means start > previous end (adjust to >= if the spec shares boundary days). Confirm inclusive vs exclusive end with the interviewer.

Practice

PYTHON · Topic — Array (Python)
COMPANY · Airbnb — array Airbnb-tagged array

For graph-indexed practice on the same hub, use Airbnb-tagged graph. The full language filter is Airbnb Python; Python medium on Airbnb matches many longer coding prompts.

General (non–company) drills: Sweep line and graph topic pages—use if you are weak on the shapes that show up in harder prompts.

7. Star schema, rental facts, and SCDs in data engineering

Dimensional modeling and SCDs for warehouse-style questions

Star schema is the default whiteboard structure: a fact table of measures at a clear grain + foreign keys to dimension tables that hold descriptive attributes. For short-term rental (STR) analytics, your fact might be booking-level, night-level, or daily listing-level—state the grain first in any design answer.

Star vs snowflake (one sentence each)

Star: dimensions are denormalized into wide tables (e.g. city name on dim_listing); fewer joins, more redundancy.
Snowflake: dimensions are normalized (e.g. city in dim_geo); more joins, less duplicate text. Tradeoff: query simplicity vs storage / consistency of lookup tables.

Fact grain for STR: booking vs nightly

Worked example (verbal): One row per booking → natural for gross booking value, cancel flags, host payout in one row. One row per (listing, night) in an occupancy calendar → natural for nightly rate × nights, blocked calendar, and revenue when each night is priced. Do not put night-level and booking-level facts in the same table without a bridge or a derived ETL—interviewers will press you on that.

Surrogate keys (listing_sk, guest_sk): The warehouse uses integers that do not change when a source system merges accounts; facts store listing_sk, not raw PMS ids when dimensions are conformed.

SCD Type 1 vs Type 2 for listing attributes

Type 1 — overwrite: one row per natural listing; fix typos or only care about current city—no history.
Type 2 — new row for change: new listing_sk (or new row in dim_listing with valid_from / valid_to / is_current) when bedrooms or city change; as-of reporting joins the version that was valid on the event date.

Worked example (table) — one natural listing, two SCD2 rows:

listing_sk	natural_id	city	valid_from	valid_to	is_current
1001	L7	NYC	2020-01-01	2024-12-31	false
1002	L7	MIA	2025-01-01	NULL	true

A 2024 booking should join listing_sk = 1001; a current-listings report uses listing_sk = 1002.

Bridge and degenerate dimensions (naming only)

Bridge if you have many-to-many (e.g. amenity flags → bridge_listing_amenity).
Degenerate dims in the fact: booking_id on a line-like fact is often stored in the fact and not in a separate 1:1 table.

Common beginner mistakes

Two grains in one fact without documentation and lineage from one canonical fact.
Type 1 for geography when revenue in 2023 by old city is still required.
Forgetting late-arriving bookings in idempotent load discussion (verbally).

Data engineering interview question on “current” vs historical listing

Explain how to store listing attributes (e.g. city, accommodates) so a 2024 revenue report for “NYC” is stable when a listing moves to another city in 2025.

Solution using SCD2 on listing dimension (verbal + sketch)

Answer outline: SCD2 rows on dim_listing with valid_from, valid_to (or is_current); fact rows for nightly or booking revenue reference listing_sk active at fact date. NYC 2024 joins to the version valid for each date, not the 2025 row.

Why this works: As-of reporting stays correct; for “current” listing rows, analysts filter on is_current, while historical facts keep their old listing_sk keys.

Data engineering interview question: one fact for nightly revenue vs one for bookings (verbal)

Prompt: We need both (a) revenue by booking for cancellations and (b) revenue by night for calendar analytics. One table or two?

Solution (verbal + sketch)

Prefer at least two facts (or one fact at nightly grain plus a booking fact at booking grain): (a) fact_booking one row per booking for gross booking value, status, payout line items. (b) fact_night one row per (booking, night) or (listing, night) for nightly rate, promo allocation, and night-level experiments. Conformed dim_listing and date make rollups consistent; a data mart can pre-join for analysts if needed.

Why this works: Different grains → different facts (or a nightly bridge with very explicit rules). Stating the grain in one clear sentence is often the scoring move.

Practice

SQL · Topic — Dimensional modeling
SQL · Topic — ETL (topic)

Use the Airbnb hub to pair modeling study with hard-difficulty problems that mention data warehouse or dimensional modeling. PipeCode also offers interview-first courses; see Data modeling for data engineering interviews and ETL system design for data engineering interviews.

Tips to crack Airbnb data engineering interviews

Data engineering interview preparation for a marketplace and travel company like Airbnb means being fluent in SQL grain and windows, in time and session questions, in practical Python (not only CRUD), and in a credible warehouse story on dimensions and SCDs.

Re-read the 20-problem mix on the Airbnb company page: it skews medium SQL, with graph and data modeling at the hard end of the set—plan reps accordingly.
For SQL, always say one sentence of grain before you write FROM. Name NULL and join cardinality assumptions.
For sessionization and windows, practice LAG / LEAD and ROW_NUMBER on paper until the partition and order are automatic.
For Python, get comfortable with dict/defaultdict, heap/sort+sweep for intervals, and read-only graph walks; add pandas merges if your loop historically uses them.
For modeling, memorize one tight example of SCD2 with as-of joins; link it to nightly vs booking facts for STR (short-term rental) metrics.

Lane	Airbnb-focused path
Company hub	/explore/practice/company/airbnb
SQL (all)	/explore/practice/company/airbnb/sql
Python (all)	/explore/practice/company/airbnb/python
By difficulty (easy)	/explore/practice/company/airbnb/difficulty/easy

Across the platform, 450+ problems with tests and AI feedback are available; use topic and company filters to stay close to the skills above.

Frequently asked questions

What is the Airbnb data engineering interview process?

Rounds often include a recruiter or hiring manager screen, one or more technical sessions (SQL and Python are common), data-centric system or pipeline design, and behavioral interviews. Exact steps vary by level and org—treat the curated Airbnb problem list as pattern prep, not a guarantee of on-site order.

What topics should I study for an Airbnb data engineering interview?

SQL: joins, aggregations, window functions, subqueries, deduplication, and time and session logic. Python: arrays and hashes, intervals (including sweep line), graphs for hierarchical data, and sometimes pandas. Modeling: star schema, facts at clear grain, and SCDs for listing-style change. This guide maps each area to practice links—see the sections and table above.

How difficult are Airbnb data engineering interview questions?

A representative 20-problem company set is often tiered as 6 easy, 11 medium, and 3 hard—medium SQL and join work is the spine; hard items may be data warehouse design or non-trivial graph or DP-style Python depending on the pack. Calibrate with easy vs medium on the same hub.

How should I prepare for an Airbnb data engineering interview?

Solve a mix of company-tagged and global topic problems, verbalize your assumptions, and revisit every miss with a one-paragraph post-mortem (what grain you chose and why). Aim for dozens of reps across SQL and Python before a onsite-style day—not a single cram night.

What programming languages does Airbnb test in data engineering interviews?

SQL and Python are the main language filters in many DE loops; some teams add verbal modeling or system design without a new language. For reps that match, start from Airbnb SQL and Airbnb Python on PipeCode.

Where can I practice Airbnb-style data engineering questions with tests and feedback?

PipeCode offers 450+ problems with automated tests and feedback; the company page for Airbnb is /explore/practice/company/airbnb. You can also browse all company hubs to compare volume and topics across employers.

Start practicing Airbnb data engineering problems

(Closing CTA is injected by regen_blog15.py → convert_cta after regeneration.)

Top Airbnb data engineering interview topics

1. Aggregation and GROUP BY in marketplace SQL

Aggregation and GROUP BY in SQL for data engineering

SUM, AVG, and NULL

COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col)

MIN / MAX

WHERE vs HAVING

Half-open date ranges and revenue by period

Conditional aggregation: SUM(CASE …) and net vs gross

SQL interview question on net booking volume by market

Solution using GROUP BY and conditional SUM

SQL interview question on distinct active guests by market (post-join)

Solution using COUNT(DISTINCT …) and WHERE

2. Joins, anti-joins, and deduplication in SQL

Joins and deduplication in SQL for data engineering

Inner join

Left join and the anti-join pattern

Semi-join: EXISTS / IN (keep left rows, do not add B’s columns)

Fan-out: when one listing matches many booking rows

Deduplication: ROW_NUMBER vs DISTINCT ON

Joins-based interview question on O–D pair volume

Solution using GROUP BY and HAVING

Joins-based interview question on guests with no completed booking

Solution using LEFT JOIN and NOT EXISTS (two valid patterns)

3. Window functions and ranking in SQL

Window functions and ranking in SQL for travel and marketplace data

ROW_NUMBER, RANK, and DENSE_RANK

LAG and LEAD for time gaps and prior values

Running totals (optional pattern)

SQL interview question on top spender per market

Solution using ROW_NUMBER

SQL interview question on guests whose second booking is to a given city (pattern with LAG)

Solution using LAG to compare sequence

4. Sessionization and time gaps in event SQL

Event gaps, sessions, and half-open time ranges in SQL

LAG and a gap predicate

Full session_id sketch (PostgreSQL-style; illustrative)

SQL interview question on orders with gap > 7 days since previous (pattern)

Solution using LAG

SQL interview question: sessions per user on one calendar day (outline + solution)

Solution (pipeline outline)

5. Subqueries, CASE, and market-style rollups in SQL

CASE, filtered aggregates, and subqueries in SQL

Searched CASE in aggregates

FILTER (PostgreSQL) vs CASE

Ratios, percentages, and NULLIF for safe division

CTEs to fix grain before a join

SQL interview question on tiered market counts

Solution using conditional counts

SQL interview question: markets with high share of long-stay (rate + HAVING)

Solution using HAVING on a ratio

6. Python: arrays, intervals, graphs, and pandas for DE screens

Data structures, sweep lines, and tabular code in Python for data engineering

set, dict, and defaultdict for grouping and membership

Two pointers (sorted arrays) and one-pass first duplicate

Intervals, sorting by end, and “non-overlapping” / sweep ideas

Trees and graphs (inheritance, category trees)

pandas: merge and groupby (and fan-out, again)

Python interview question on first duplicate in order ids

Solution using a set

Python interview question on merge quantities by key

Solution using defaultdict

Python interview question on maximum number of non-overlapping stay intervals (greedy by end time)

Solution: sort by end, greedy count

7. Star schema, rental facts, and SCDs in data engineering

Dimensional modeling and SCDs for warehouse-style questions

Star vs snowflake (one sentence each)

Fact grain for STR: booking vs nightly

SCD Type 1 vs Type 2 for listing attributes

Bridge and degenerate dimensions (naming only)

Data engineering interview question on “current” vs historical listing

Solution using SCD2 on listing dimension (verbal + sketch)

Data engineering interview question: one fact for nightly revenue vs one for bookings (verbal)

Solution (verbal + sketch)

Tips to crack Airbnb data engineering interviews

Frequently asked questions

What is the Airbnb data engineering interview process?

What topics should I study for an Airbnb data engineering interview?

How difficult are Airbnb data engineering interview questions?

How should I prepare for an Airbnb data engineering interview?

`SUM`, `AVG`, and `NULL`

`COUNT(*)` vs `COUNT(col)` vs `COUNT(DISTINCT col)`

`MIN` / `MAX`

`WHERE` vs `HAVING`

Conditional aggregation: `SUM(CASE …)` and net vs gross

Solution using `GROUP BY` and conditional `SUM`

Solution using `COUNT(DISTINCT …)` and `WHERE`

Semi-join: `EXISTS` / `IN` (keep left rows, do not add B’s columns)

Fan-out: when one `listing` matches many `booking` rows

Deduplication: `ROW_NUMBER` vs `DISTINCT ON`

Solution using `GROUP BY` and `HAVING`

Solution using `LEFT JOIN` and `NOT EXISTS` (two valid patterns)

`ROW_NUMBER`, `RANK`, and `DENSE_RANK`

`LAG` and `LEAD` for time gaps and prior values

Solution using `ROW_NUMBER`

SQL interview question on guests whose second booking is to a given city (pattern with `LAG`)

Solution using `LAG` to compare sequence

`LAG` and a gap predicate

Full `session_id` sketch (PostgreSQL-style; illustrative)

SQL interview question on `orders` with gap > 7 days since previous (pattern)

Solution using `LAG`

5. Subqueries, `CASE`, and market-style rollups in SQL

`CASE`, filtered aggregates, and subqueries in SQL

Searched `CASE` in aggregates

`FILTER` (PostgreSQL) vs `CASE`

Ratios, percentages, and `NULLIF` for safe division

SQL interview question: markets with high share of long-stay (rate + `HAVING`)

Solution using `HAVING` on a ratio

6. Python: arrays, intervals, graphs, and `pandas` for DE screens

`set`, `dict`, and `defaultdict` for grouping and membership

`pandas`: `merge` and `groupby` (and fan-out, again)

Solution using `defaultdict`