Airbnb data engineering interview questions lean on SQL for marketplace-style analytics (aggregations, joins, window functions, and sessionization from event streams) and Python for data structures, intervals, and occasional graph or pandas-style coding. Deeper rounds often add verbal dimensional modeling and data warehouse design (star schema, SCDs, booking grain).
Everything below uses original teaching examples only—not proprietary interview items. The guide is structured for data engineering interview prep: worked examples in SQL and Python, a data modeling section, and topic-aligned practice via PipeCode’s Airbnb company hub. Narrate grain, join semantics, and NULL behavior as you go; interviewers care about the reasoning, not just the final answer.
Top Airbnb data engineering interview topics
| # | Topic | Why it shows up |
|---|---|---|
| 1 | Aggregation & GROUP BY / HAVING |
Nightly price by market, booking counts, O–D (origin–destination) pairs, listing metrics. |
| 2 | Joins & anti-joins | Listings to bookings, page events to sessions, avoid fan-out. |
| 3 | Window functions | “First” booking, distinct visitors per scope, LAG / LEAD, ranking. |
| 4 | Gaps, sessions & time | Session ids from inactivity (e.g. 30+ minutes), half-open time ranges. |
| 5 | Subqueries, CASE, market-style reports |
Filtered rollups, market pulse, conditional metrics. |
| 6 | Python: arrays, intervals, graphs, pandas |
Max revenue from bookings, multi-dimensional iterators, attribute inheritance, merges. |
| 7 | Star schema, rental facts & SCDs | STR grain, slowly changing listing or host attributes. |
SQL evaluation order (mental model):
FROM/ joins →WHERE(row filter) →GROUP BY→ aggregates →HAVING(group filter) → windows →SELECT/ORDER BY. If you are filtering a sum or count, that condition almost always belongs inHAVING, notWHERE.
1. Aggregation and GROUP BY in marketplace SQL
Aggregation and GROUP BY in SQL for data engineering
Picture a table with many detail rows—often one row per booking or per night line item. Aggregation means: turn lots of rows into one summary value (or a few) per bucket. The bucket is whatever the prompt names: per market, per listing, per origin–destination pair, or per (listing, week). GROUP BY names that bucket: every distinct combination of the grouping keys is one group; the engine runs SUM, COUNT, AVG inside each group separately.
Report grain (say this in the interview): After the query, ask, “One output row is what, in business terms?” If the question asked “gross booking dollars per market,” the grain is market. If you GROUP BY at listing by mistake, you are answering a different question.
SUM, AVG, and NULL
-
SUM(col)adds non-null numerics in the group.NULLis skipped (not coerced to 0). If every value in the group isNULL,SUMis usuallyNULL, not 0. -
AVG(col)is sum of non-null values ÷ count of non-null values in that group.
Worked example: In one market_id group, amount_usd values 200, NULL, 300 → SUM(amount_usd) = 500, AVG(amount_usd) = 250 (only two values averaged).
COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col)
-
COUNT(*)counts rows in the group, including rows where some column isNULL. -
COUNT(col)counts rows wherecolis notNULL. -
COUNT(DISTINCT col)counts unique non-null values—required after a join that multiplies a guest or listing across many rows, when the question is “how many guests” not “how many booking rows.”
Worked example: After a one-to-many join, two rows for the same guest_id → COUNT(*) = 2, COUNT(DISTINCT guest_id) = 1.
MIN / MAX
Use for prices, dates, ranks. NULL is skipped (same as in SUM semantics). In marketplace data, MAX(created_at) is often “latest booking in this group” before you switch to a window in §3.
WHERE vs HAVING
-
WHEREfilters input rows beforeGROUP BY—predicates on raw columns of the table as seen before grouping. -
HAVINGfilters afterGROUP BYon aggregates—e.g.HAVING SUM(amount) > 1000orHAVING COUNT(*) >= 2.
Worked example — per-market subtotals, two different filters:
| booking_id | market_id | subtotal | is_cancelled |
|---|---|---|---|
| b1 | M1 | 80 | false |
| b2 | M1 | 30 | true |
| b3 | M2 | 200 | false |
-
WHERE NOT is_cancelledthenGROUP BY market_id, SUM(subtotal)→ M1 = 80, M2 = 200 (canceled row dropped first). - Same table without
WHERE, thenHAVING SUM(subtotal) > 50(sum of all subtotals in the group) → both markets can have differentSUMthan in the first bullet—be explicit which population the question means (all rows vs. completed only).
Rule of thumb: If the filter uses an aggregate of the group (SUM, COUNT, AVG), it belongs in HAVING, not WHERE. If it only uses one row at a time (e.g. status = 'confirmed'), use WHERE.
Half-open date ranges and revenue by period
Worked example: Nightly amount for calendar year 2024 per market_id:
SELECT market_id, SUM(amount_usd) AS rev
FROM night_stays
WHERE stay_date >= DATE '2024-01-01' AND stay_date < DATE '2025-01-01'
GROUP BY market_id;
Takeaway: The interval [2024-01-01, 2025-01-01) (half-open) avoids double counting a boundary if you also store timestamps at midnight. Say timezone if the table is TIMESTAMPTZ and the business rule is not UTC.
Conditional aggregation: SUM(CASE …) and net vs gross
Idea: Sum or count only rows that meet a business rule without a separate WHERE that would drop the rest of the group from the query entirely when you need both “gross” and “net” in one SELECT, or you must include partial rows in the denominator of a rate.
Worked example — net booking dollars after cancellations (same as before, now with **full pattern):**
| booking_id | market_id | amount_usd | is_cancelled |
|---|---|---|---|
| b1 | M1 | 200 | false |
| b2 | M1 | 100 | true |
SUM(CASE WHEN NOT is_cancelled THEN amount_usd ELSE 0 END) = 200; gross would be SUM(amount_usd) = 300 if the prompt asked for both columns.
Common beginner mistakes
- Putting a condition on
SUM(x)inWHEREinstead ofHAVING. -
COUNT(*)when the business asked for unique guests or listings—useCOUNT(DISTINCT …)at the right grain. -
Joining one-to-many before
GROUP BYand summing a dimension field that repeats—double count; fix at grain or dedupe first (see §2).
SQL interview question on net booking volume by market
Table: bookings(booking_id, market_id, amount_usd, is_cancelled). Return each market_id and net booking volume (sum of amount_usd where NOT is_cancelled) for markets where that net sum is strictly positive, ordered by net volume descending.
Solution using GROUP BY and conditional SUM
SELECT market_id,
SUM(CASE WHEN NOT is_cancelled THEN amount_usd ELSE 0 END) AS net_vol
FROM bookings
GROUP BY market_id
HAVING SUM(CASE WHEN NOT is_cancelled THEN amount_usd ELSE 0 END) > 0
ORDER BY net_vol DESC;
Why this works: Cancellations contribute 0 in the CASE; HAVING filters per-market aggregates, not per-row. Grain: one row per market_id.
SQL interview question on distinct active guests by market (post-join)
Table: bookings(booking_id, market_id, guest_id, is_cancelled) (one row per booking). Return each market_id and COUNT(DISTINCT guest_id) among non-cancelled bookings only, ordered by that count descending.
Solution using COUNT(DISTINCT …) and WHERE
SELECT market_id, COUNT(DISTINCT guest_id) AS active_guests
FROM bookings
WHERE NOT is_cancelled
GROUP BY market_id
ORDER BY active_guests DESC;
Why this works: WHERE restricts to completed rows before grouping; COUNT(DISTINCT guest_id) counts people, not booking rows, at market grain. If a guest has two completed bookings in the same market, they still count once.
Practice
- SQL · Topic — Aggregation problems (all companies)
- COMPANY · Airbnb — aggregation Airbnb-tagged aggregation
When you want SQL by difficulty on the same hub, open Airbnb SQL (all difficulties) or Airbnb easy SQL to warm up.
2. Joins, anti-joins, and deduplication in SQL
Joins and deduplication in SQL for data engineering
A join answers: “For each row in A, which row(s) in B match the ON condition?” In marketplace analytics, you join facts (bookings, page views, impressions) to dimensions (listing, market, user profile) to get attributes for reporting. You are evaluated on: inner vs outer use, anti-join for “has no …”, and fan-out when a key is not unique on one side.
Fact vs dimension and grain (how interviewers want you to talk about joins):
-
Fact: many rows, often one row per event or booking—the thing you
SUMorCOUNT. -
Dimension: one row per entity (in an ideal, clean model)—listing, host, city—enriched via the join. If a dimension is not unique (duplicate
listing_idrows), a join to facts can multiply rows; summing a dimension’s attribute then inflates the metric.
Inner join
Result: Only rows that satisfy the ON on both sides. Unmatched keys disappear.
Worked example — bookings inner join listings on listing_id: if a booking’s listing_id was deleted in source and is missing in listings, that booking vanishes from the result. If the question is “revenue for known listings,” that may be right; if it is “all revenue, including orphan bookings,” you need a left or fuzzy key strategy—state your assumption out loud.
Left join and the anti-join pattern
LEFT JOIN returns all left-side rows, padding right columns with NULL when there is no match. Anti-join: LEFT JOIN … WHERE right.not_null_col IS NULL keeps only left rows with no partner in B. Classic: “Guests who have never completed a booking.”
Worked example (mini): guests ids G1, G2; bookings has G1 only. FROM guests g LEFT JOIN bookings b ON g.id = b.guest_id WHERE b.booking_id IS NULL returns G2 only.
Semi-join: EXISTS / IN (keep left rows, do not add B’s columns)
“Listings that had at least one booking in NYC” is often a filter on listings, not a full merge. WHERE EXISTS (SELECT 1 FROM bookings … WHERE … listing_id = listings.id) keeps one row per listing; compare to joining all bookings, then DISTINCT, which is heavier and easier to get wrong on grain.
Fan-out: when one listing matches many booking rows
Bug pattern: SUM(dimension.accommodates) after bookings ⋈ listings with one row per booking will add the same accommodates once per booking, which is not “total capacity in the market” but “sum of capacity over booking rows*.” **Fixes:* SUM at listing grain in a subquery, MAX/MIN of attribute per listing, or COUNT(DISTINCT listing_id)-style questions instead of a blind sum.
Worked example:
| guest_id (dim) | credit_score |
|---|---|
| 1 | 720 |
| booking_id | guest_id |
|---|---|
| b1 | 1 |
| b2 | 1 |
JOIN gives two rows; SUM(credit_score) = 1440 if you are not careful—wrong for “system-wide average credit of guests with bookings*.” Use *AVG per guest** in a CTE, or MAX(credit_score) if one value per guest_id is intended in the result.
Deduplication: ROW_NUMBER vs DISTINCT ON
“Latest base_price per listing” in a type-2-ish history table: prefer ROW_NUMBER() OVER (PARTITION BY listing_id ORDER BY as_of_date DESC) AS rn then WHERE rn = 1 in an outer filter so you control ties. DISTINCT ON (listing_id) … ORDER BY … (PostgreSQL) is compact; tie semantics must match the business rule (e.g. break ties on ingest_id).
Worked example (pattern only):
SELECT listing_id, base_price, as_of_date
FROM (
SELECT listing_id, base_price, as_of_date,
ROW_NUMBER() OVER (
PARTITION BY listing_id ORDER BY as_of_date DESC, ingest_id DESC
) AS rn
FROM listing_price_history
) t
WHERE rn = 1;
Common beginner mistakes
- Joining on
NULL = NULL: not true; rows do not match—useCOALESCEor explicitIS NULLlogic only if the spec allows it. - Inner join when the question needs kept empty sides (left / outer).
- Aggregating at join-inflated grain without a pre-aggregate or distinct step.
Joins-based interview question on O–D pair volume
Tables: bookings(booking_id, origin_city, dest_city, amount_usd, is_cancelled). Return each ordered pair (origin_city, dest_city) with count of non-cancelled bookings, excluding pairs where count < 2. Order by count descending.
Solution using GROUP BY and HAVING
SELECT origin_city, dest_city, COUNT(*) AS n
FROM bookings
WHERE NOT is_cancelled
GROUP BY origin_city, dest_city
HAVING COUNT(*) >= 2
ORDER BY n DESC;
Why this works: Grain = (origin, dest); WHERE filters rows; HAVING enforces the at least two completed bookings rule on the group.
Joins-based interview question on guests with no completed booking
Tables: guests(guest_id, email), bookings(booking_id, guest_id, status) with status in ('pending','confirmed','cancelled'). Return guest_id and email for guests with no row where status = 'confirmed'.
Solution using LEFT JOIN and NOT EXISTS (two valid patterns)
Option A — anti-join:
SELECT g.guest_id, g.email
FROM guests g
LEFT JOIN bookings b
ON b.guest_id = g.guest_id AND b.status = 'confirmed'
WHERE b.booking_id IS NULL;
Option B — NOT EXISTS:
SELECT g.guest_id, g.email
FROM guests g
WHERE NOT EXISTS (
SELECT 1
FROM bookings b
WHERE b.guest_id = g.guest_id AND b.status = 'confirmed'
);
Why this works: Option A: unmatched LEFT JOIN row → NULL booking_id; we keep those. Option B: set membership; often clearer when bookings can have multiple conditions. Both are O(joins)-style patterns interviewers like you to name.
Practice
- SQL · Topic — Joins & deduplication
- COMPANY · Airbnb — joins Airbnb-tagged joins
Airbnb-tagged join drills (easy) are a good next step.
3. Window functions and ranking in SQL
Window functions and ranking in SQL for travel and marketplace data
GROUP BY collapses many rows to one per group. Window functions keep every input row and attach a value computed over a window of rows—usually “all rows in the same partition that sort before or with this row.” In travel and listing data, that is how you express first booking to a city, top-N hosts by neighborhood, running revenue, or gaps between events.
Parts of a window: PARTITION BY (optional) splits the table into independent groups—e.g. per guest_id or per (market_id, listing_id). ORDER BY inside OVER (…) defines order within that partition. ROWS BETWEEN … (optional) bounds the frame for running sums (not always needed in interviews if the default frame is correct).
ROW_NUMBER, RANK, and DENSE_RANK
All need ORDER BY in OVER (for meaningful output).
-
ROW_NUMBER(): Unique integers 1..n in the partition after sort—no ties; if two rows tie on the sort key, the engine picks a stable order unless you add tie-break columns (e.g.guest_id ASC). -
RANK(): Tied rows get the same rank; next rank skips (1, 2, 2, 4). -
DENSE_RANK(): Tied rows get the same rank; no gaps (1, 2, 2, 3).
Worked example — three trips in one market_id by amt_usd desc:
| guest_id | amt_usd |
|---|---|
| 10 | 500 |
| 20 | 400 |
| 30 | 400 |
ROW_NUMBER might be 1, 2, 3 (with tie-break on guest_id). RANK is 1, 2, 2; DENSE_RANK is 1, 2, 2. Use ROW_NUMBER when the problem says “pick one winner per market” with a clear tie-break; use DENSE_RANK when you want “top 3 price tiers” (ties share a tier).
LAG and LEAD for time gaps and prior values
-
LAG(col, 1)= previous row’scolin partition order; first row in partition isNULL. -
LEAD(col, 1)= next row (useful for “time to next booking**”).
Worked example (dates): LAG(order_date) per guest_id in order → gap in days = order_date - LAG (for DATE type in PostgreSQL-style SQL, difference is an integer day count; for timestamps, use EPOCH difference / 60 for minutes in session rules).
Running totals (optional pattern)
Cumulative SUM(amount) OVER (PARTITION BY market_id ORDER BY booking_ts ROWS UNBOUNDED PRECEDING AND CURRENT ROW) = running total per market in time order—same idea as “year-to-date” metrics with a clear ORDER BY.
Common beginner mistakes
- Forgetting
ORDER BYinOVERforROW_NUMBER/RANK/LAG, creating nondeterministic or meaningless output. - Using
RANK= 1 when two rows tie and the prompt still wants a unique “winner”—add a tie-break toORDER BYand useROW_NUMBER, or get clarification.
SQL interview question on top spender per market
Table: spend(guest_id, market_id, amt_usd) (one row per largest single spend event per guest-market if the prompt implies that; if multiple rows per guest-market, clarify—here assume one row per guest-market). Return the guest_id with the highest amt_usd in each market_id (ties on amount: pick lowest guest_id).
Solution using ROW_NUMBER
SELECT guest_id, market_id, amt_usd
FROM (
SELECT guest_id, market_id, amt_usd,
ROW_NUMBER() OVER (
PARTITION BY market_id ORDER BY amt_usd DESC, guest_id ASC
) AS rn
FROM spend
) t
WHERE rn = 1;
Why this works: rn = 1 is exactly one row per market_id; guest_id ASC breaks ties on amt_usd as specified.
SQL interview question on guests whose second booking is to a given city (pattern with LAG)
Table: bookings(guest_id, booking_seq, dest_city, booking_ts) with booking_seq 1, 2, 3, … in true chronological order per guest (simplified). Return guest_id for guests whose second booking has dest_city = 'NYC'.
Solution using LAG to compare sequence
If booking_seq is 1-based and consecutive, a simpler form is: WHERE booking_seq = 2 AND dest_city = 'NYC'. The window version (when you only have booking_ts, not a sequence number):
SELECT guest_id
FROM (
SELECT guest_id, dest_city,
ROW_NUMBER() OVER (PARTITION BY guest_id ORDER BY booking_ts) AS rn
FROM bookings
) t
WHERE rn = 2 AND dest_city = 'NYC';
Why this works: ROW_NUMBER over ORDER BY booking_ts labels 1st, 2nd, 3rd trip per guest; we filter the 2nd row. If the question instead needs LAG, LAG(dest_city) OVER (PARTITION BY guest_id ORDER BY booking_ts) on the third event row gives “second destination*” only if the grain is one row per leg—match the ERD to the business rule*.
Practice
- SQL · Topic — Window functions (SQL)
- COMPANY · Airbnb — windows Airbnb-tagged window functions
4. Sessionization and time gaps in event SQL
Event gaps, sessions, and half-open time ranges in SQL
Sessionization questions ask you to group events into sessions when silence exceeds a threshold—classic for web, app, and search logs. A common rule: start a new session if the time since the previous event in the same partition (e.g. user_id) is strictly more than 30 minutes (confirm N and strict vs ≥ in the spec).
Three-step set-based pattern:
-
LAG(event_ts) OVER (PARTITION BY user_id ORDER BY event_ts)to align each event with the previous. -
is_start:1if no previous, else1if the gap (seconds/minutes) > threshold, else0. -
session_id:SUM(is_start) OVER (PARTITION BY user_id ORDER BY event_ts)—a running count of session starts, which labels each row with a monotonic session id within the user (1,1,1,2,2,… in order of time).
Half-open reporting windows on dates/timestamps: use [start, end) (see §1) so you do not double count events at a midnight boundary in mixed DATE/TIMESTAMP stores.
Worked example — gap flags and cumulative id:
| user_id | ts | (gap from prev) | is_start | session_id (cumulative) |
|---|---|---|---|---|
| u1 | 10:00 | — | 1 | 1 |
| u1 | 10:10 | 10m | 0 | 1 |
| u1 | 11:00 | 50m | 1 | 2 |
(With 30-minute inactivity to start a new session.)
LAG and a gap predicate
After prev_ts is in the row, “is new session?” is prev_ts IS NULL OR (event_ts - prev_ts) > interval '30 minutes' (dialect-specific).
Full session_id sketch (PostgreSQL-style; illustrative)
Worked example:
WITH t AS (
SELECT user_id, event_id, event_ts,
LAG(event_ts) OVER (PARTITION BY user_id ORDER BY event_ts) AS prev_ts
FROM events
),
f AS (
SELECT *,
CASE
WHEN prev_ts IS NULL THEN 1
WHEN event_ts - prev_ts > INTERVAL '30 minutes' THEN 1
ELSE 0
END AS is_start
FROM t
)
SELECT user_id, event_id, event_ts,
SUM(is_start) OVER (PARTITION BY user_id ORDER BY event_ts) AS session_id
FROM f;
Why this works: is_start fires at each new session boundary; the prefix sum in time order is the session counter for that user_id.
Common beginner mistakes
- Time zone drift between ingest and display—fix a convention in the answer.
- Wrong
PARTITION BY(e.g. session across users). - Reusing a global
session_idwithout resetting per day when the spec says “sessions that day”—filter events to the day before you count distinct session labels in that range.
SQL interview question on orders with gap > 7 days since previous (pattern)
Table: orders(order_id, guest_id, order_date) (DATE). Return order_id where the day gap from this guest’s previous order_date (by order_date ascending) is > 7; drop a guest’s first order (no predecessor).
Solution using LAG
SELECT order_id, guest_id, order_date, prev_d,
order_date - prev_d AS gap_days
FROM (
SELECT order_id, guest_id, order_date,
LAG(order_date) OVER (
PARTITION BY guest_id ORDER BY order_id, order_date
) AS prev_d
FROM orders
) t
WHERE prev_d IS NOT NULL
AND order_date - prev_d > 7;
Why this works: LAG + prev_d IS NOT NULL (first order removed); for app-event gaps in minutes, switch to TIMESTAMP and EPOCH difference.
SQL interview question: sessions per user on one calendar day (outline + solution)
Table: events(user_id, event_ts); 30-minute rule. Count of distinct sessions per user_id for event_ts on 2024-06-01 (UTC; state the assumption). (Use the CTE pattern above, then WHERE event_ts >= … AND event_ts < next day, GROUP BY user_id, COUNT(DISTINCT session_id) or equivalent.)
Solution (pipeline outline)
-
Compute
session_idfor allevents(full sketch). -
WHERE event_ts >= TIMESTAMP '2024-06-01 00:00:00' AND event_ts < TIMESTAMP '2024-06-02 00:00:00'. -
SELECT user_id, MAX(session_id) - MIN(session_id) + 1is not always valid if global ids; safer:COUNT(DISTINCT (user_id, session_id))per user over filtered rows, orCOUNT(*)of session-start events per user after a second pass that marks starts—simplest:COUNT(DISTINCT user_id || '-' || session_id::text)in dialects that lack tuple distinct. (Interview credit is for theLAG+is_start+ prefix-sum pattern, not the perfect distinct hack.)
Why this works: The count of logical session segments in the time window is what you need; match your id construction to a per-user or per-day reset if the spec requires it.
Practice
- SQL · Topic — Time-series SQL
- COMPANY · Airbnb — hub All Airbnb problems
Medium-difficulty SQL on Airbnb often mixes windows with time logic.
5. Subqueries, CASE, and market-style rollups in SQL
CASE, filtered aggregates, and subqueries in SQL
Market or “pulse” reports usually slice a fact table (bookings, listings) by market_id, category, or date, and ask for multiple metrics in one query—conditional counts/sums are the workhorse. You express “only confirmed” or “nights at least 7” either with WHERE (if the whole row is in/out of the population) or with SUM(CASE …) / COUNT(*) FILTER (WHERE …) when the same SELECT must output both all-row counts and restricted metrics.
Searched CASE in aggregates
Idea: Build a 0/1 or NULL flag inside the aggregate. COUNT(*) still counts rows; SUM(CASE WHEN condition THEN 1 END) counts rows where the condition is true and SUM skips NULL.
Worked example (logical): In one GROUP BY market_id, you can output: total rows, SUM(CASE WHEN status = 'confirmed' THEN 1 END), and SUM(CASE WHEN status = 'confirmed' AND nights >= 7 THEN 1 END) in one pass.
FILTER (PostgreSQL) vs CASE
COUNT(*) FILTER (WHERE status = 'confirmed') is often clearer than nested SUM(CASE…); both compile to the same class of plan in many optimizers. If the engine has no FILTER, translate mechanically to SUM(CASE WHEN … THEN 1 ELSE 0 END).
Ratios, percentages, and NULLIF for safe division
Worked example (pattern):
SUM(CASE WHEN is_click THEN 1.0 ELSE 0 END) / NULLIF(SUM(CASE WHEN is_impression THEN 1.0 ELSE 0 END), 0) for CTR at (campaign, user) grain—dedupe impressions/clicks to one row per pair first if the spec says so, or the denominator will not match the business (see §2 fan-out).
CTEs to fix grain before a join
Idea: “Markets with > 5 active listings” may need FROM (SELECT market_id, COUNT(*) AS n FROM listings WHERE is_active GROUP BY market_id) t WHERE n > 5 or a CTE with the same. The CTEs are readability; the key is aggregate in an inner query, filter in an outer query when the predicate is on the aggregate, not a row flag.
Common beginner mistakes
-
Integer division in
a/bwhen both are integers—use* 1.0,CAST(… AS FLOAT), orNUMERIC(12,4). - CTR at impression grain vs (user, campaign)—match the dedupe rule to the denominator definition.
-
WHEREonSUM(…)in the same level as the rawFROM—illegal in standard SQL; useHAVINGor a subquery/CTE.
SQL interview question on tiered market counts
Table: bookings(listing_id, market_id, nights, status); status in ('pending','confirmed','cancelled'). Return each market_id, the count of confirmed bookings, and the count of confirmed bookings with nights >= 7.
Solution using conditional counts
SELECT market_id,
COUNT(*) FILTER (WHERE status = 'confirmed') AS n_conf,
COUNT(*) FILTER (WHERE status = 'confirmed' AND nights >= 7) AS n_long_stay
FROM bookings
GROUP BY market_id;
Why this works: One pass per market_id; FILTER is the readable form of two independent filters on the confirmed population for the long-stay column.
SQL interview question: markets with high share of long-stay (rate + HAVING)
Table: same as above. Return market_id where among confirmed bookings, at least 20% have nights >= 7, and there are at least 10 confirmed bookings in that market.
Solution using HAVING on a ratio
SELECT market_id
FROM bookings
WHERE status = 'confirmed'
GROUP BY market_id
HAVING COUNT(*) >= 10
AND SUM(CASE WHEN nights >= 7 THEN 1 ELSE 0 END) * 1.0 / NULLIF(COUNT(*), 0) >= 0.2;
Why this works: Confirmed-only base comes from WHERE status = 'confirmed'. The numerator is long-stay count; denominator is all confirmed rows in the group; HAVING enforces both min volume and min share. Use 1.0 to avoid integer division in engines that are picky.
Practice
- SQL · Topic — Conditional logic
- SQL · Topic — Subqueries
6. Python: arrays, intervals, graphs, and pandas for DE screens
Data structures, sweep lines, and tabular code in Python for data engineering
Python screens for data engineering (including marketplace-style Airbnb sets) are usually O(n log n) or O(n): you hash keys, sort for intervals, or BFS/DFS on a small graph. You are not expected to re-prove theorems, but to name complexity, avoid nested scans over the whole list per element, and match the grain of pandas merges to join rules in SQL.
set, dict, and defaultdict for grouping and membership
-
set: O(1) averageinandadd—first duplicate, “seen,” uniques. -
dict/defaultdict: aggregate in one pass by key—same asGROUP BYin SQL.defaultdict(int)avoidsKeyErroron+=.
Worked example — frequency: ['a','b','a'] → Counter or dict gives a:2, b:1.
Two pointers (sorted arrays) and one-pass first duplicate
Two pointers often appear in merged or intersection problems after sorting. First duplicate in read order is not two-pointer; it is one left-to-right with a set.
Worked example — merge two sorted unique lists (sketch): i, j from 0, advance the smaller, append to out—O(n+m) time, O(1) extra if output can be in place.
Intervals, sorting by end, and “non-overlapping” / sweep ideas
Many “booking” or reservation problems are intervals [start, end). Sorted by end time, a greedy that takes an interval if its start is after the last end is the classic “max number of non-overlapping” pattern (or min rooms with a min-heap of end times for concurrent count—know which variant the prompt is asking for).
Worked example (intervals, logical): [(1,3),(2,4),(3,5)] — if inclusive overlap at a point, 3 are concurrent at t=3; a sweep counts +1 on start events, -1 on end; max of prefix is concurrent count.
Trees and graphs (inheritance, category trees)
Parent pointer to root or adj list for BFS/DFS from a source; O(V+E). Inheritance of listing attributes in a tree is often: walk up to root until a set value is found (with path or cache in harder variants).
Worked example (verbal): parent[child] = p in a dict—from leaf to root in O(depth)`.
pandas: merge and groupby (and fan-out, again)
pd.merge(orders, dim, on='listing_id', how='left') then g = merged.groupby('market_id')['amount'].sum() is “SQL join + GROUP BY” in memory. If dim has duplicate listing_id rows, sum of a dimension column can double-count (same fan-out story as §2).
Worked example (conceptual pandas): After a one-to-one listing dimension, merge(...).groupby('market_id', as_index=False)['gross_nights'].sum() matches per-market grain.
Common beginner mistakes
-
O(n²) from
for i+x in listwherelistis large—precompute asetof allowed ids. - Off-by-one on inclusive vs half-open interval ends in overlap checks.
-
pandasmerge on non-unique right keys and summing a duplicated column from the right.
Python interview question on first duplicate in order ids
Input: ids: list[int]. Return the first value that appears twice in left-to-right order, or None if there is no duplicate.
Solution using a set
python
def first_dup(ids: list[int]) -> int | None:
seen: set[int] = set()
for x in ids:
if x in seen:
return x
seen.add(x)
return None
Why this works: O(n) time, O(n) space; the first time you see a value for the second time, return it.
Python interview question on merge quantities by key
Input: two lists of dicts a and b, each with keys sku, qty. Return one dict sku -> total qty.
Solution using defaultdict
`python
from collections import defaultdict
def merge(a: list[dict], b: list[dict]) -> dict[str, int]:
t: dict[str, int] = defaultdict(int)
for row in a + b:
t[row["sku"]] += row["qty"]
return dict(t)
`
Why this works: One pass; same pattern as GROUP BY sku, SUM(qty).
Python interview question on maximum number of non-overlapping stay intervals (greedy by end time)
Input: intervals: list[tuple[int, int]] as (start, end) on a discrete day line; treat non-overlapping as end < next start (equivalently half-open [start, end) in continuous time). Return the maximum number of pairwise non-overlapping intervals (classic greedy: sort by end, then take the next interval whose start is strictly after the last chosen end).
Solution: sort by end, greedy count
python
def max_non_overlapping(intervals: list[tuple[int, int]]) -> int:
if not intervals:
return 0
intervals = sorted(intervals, key=lambda x: x[1])
last_end = float("-inf")
n = 0
for s, e in intervals:
if s > last_end:
n += 1
last_end = e
return n
Why this works: Earliest-end-first greedy is optimal for max count of non-overlapping intervals on a line when no overlap means start > previous end (adjust to >= if the spec shares boundary days). Confirm inclusive vs exclusive end with the interviewer.
Practice
- PYTHON · Topic — Array (Python)
- COMPANY · Airbnb — array Airbnb-tagged array
For graph-indexed practice on the same hub, use Airbnb-tagged graph. The full language filter is Airbnb Python; Python medium on Airbnb matches many longer coding prompts.
General (non–company) drills: Sweep line and graph topic pages—use if you are weak on the shapes that show up in harder prompts.
7. Star schema, rental facts, and SCDs in data engineering
Dimensional modeling and SCDs for warehouse-style questions
Star schema is the default whiteboard structure: a fact table of measures at a clear grain + foreign keys to dimension tables that hold descriptive attributes. For short-term rental (STR) analytics, your fact might be booking-level, night-level, or daily listing-level—state the grain first in any design answer.
Star vs snowflake (one sentence each)
-
Star: dimensions are denormalized into wide tables (e.g. city name on
dim_listing); fewer joins, more redundancy. -
Snowflake: dimensions are normalized (e.g. city in
dim_geo); more joins, less duplicate text. Tradeoff: query simplicity vs storage / consistency of lookup tables.
Fact grain for STR: booking vs nightly
Worked example (verbal): One row per booking → natural for gross booking value, cancel flags, host payout in one row. One row per (listing, night) in an occupancy calendar → natural for nightly rate × nights, blocked calendar, and revenue when each night is priced. Do not put night-level and booking-level facts in the same table without a bridge or a derived ETL—interviewers will press you on that.
Surrogate keys (listing_sk, guest_sk): The warehouse uses integers that do not change when a source system merges accounts; facts store listing_sk, not raw PMS ids when dimensions are conformed.
SCD Type 1 vs Type 2 for listing attributes
- Type 1 — overwrite: one row per natural listing; fix typos or only care about current city—no history.
-
Type 2 — new row for change: new
listing_sk(or new row indim_listingwithvalid_from/valid_to/is_current) when bedrooms or city change; as-of reporting joins the version that was valid on the event date.
Worked example (table) — one natural listing, two SCD2 rows:
| listing_sk | natural_id | city | valid_from | valid_to | is_current |
|---|---|---|---|---|---|
| 1001 | L7 | NYC | 2020-01-01 | 2024-12-31 | false |
| 1002 | L7 | MIA | 2025-01-01 | NULL | true |
A 2024 booking should join listing_sk = 1001; a current-listings report uses listing_sk = 1002.
Bridge and degenerate dimensions (naming only)
- Bridge if you have many-to-many (e.g. amenity flags → bridge_listing_amenity).
-
Degenerate dims in the fact:
booking_idon a line-like fact is often stored in the fact and not in a separate 1:1 table.
Common beginner mistakes
- Two grains in one fact without documentation and lineage from one canonical fact.
- Type 1 for geography when revenue in 2023 by old city is still required.
- Forgetting late-arriving bookings in idempotent load discussion (verbally).
Data engineering interview question on “current” vs historical listing
Explain how to store listing attributes (e.g. city, accommodates) so a 2024 revenue report for “NYC” is stable when a listing moves to another city in 2025.
Solution using SCD2 on listing dimension (verbal + sketch)
Answer outline: SCD2 rows on dim_listing with valid_from, valid_to (or is_current); fact rows for nightly or booking revenue reference listing_sk active at fact date. NYC 2024 joins to the version valid for each date, not the 2025 row.
Why this works: As-of reporting stays correct; for “current” listing rows, analysts filter on is_current, while historical facts keep their old listing_sk keys.
Data engineering interview question: one fact for nightly revenue vs one for bookings (verbal)
Prompt: We need both (a) revenue by booking for cancellations and (b) revenue by night for calendar analytics. One table or two?
Solution (verbal + sketch)
- Prefer at least two facts (or one fact at nightly grain plus a booking fact at booking grain): (a)
fact_bookingone row per booking for gross booking value, status, payout line items. (b)fact_nightone row per (booking, night) or (listing, night) for nightly rate, promo allocation, and night-level experiments. Conformeddim_listingand date make rollups consistent; a data mart can pre-join for analysts if needed.
Why this works: Different grains → different facts (or a nightly bridge with very explicit rules). Stating the grain in one clear sentence is often the scoring move.
Practice
- SQL · Topic — Dimensional modeling
- SQL · Topic — ETL (topic)
Use the Airbnb hub to pair modeling study with hard-difficulty problems that mention data warehouse or dimensional modeling. PipeCode also offers interview-first courses; see Data modeling for data engineering interviews and ETL system design for data engineering interviews.
Tips to crack Airbnb data engineering interviews
Data engineering interview preparation for a marketplace and travel company like Airbnb means being fluent in SQL grain and windows, in time and session questions, in practical Python (not only CRUD), and in a credible warehouse story on dimensions and SCDs.
- Re-read the 20-problem mix on the Airbnb company page: it skews medium SQL, with graph and data modeling at the hard end of the set—plan reps accordingly.
- For SQL, always say one sentence of grain before you write
FROM. Name NULL and join cardinality assumptions. - For sessionization and windows, practice
LAG/LEADandROW_NUMBERon paper until the partition and order are automatic. - For Python, get comfortable with dict/defaultdict, heap/sort+sweep for intervals, and read-only graph walks; add
pandasmerges if your loop historically uses them. - For modeling, memorize one tight example of SCD2 with as-of joins; link it to nightly vs booking facts for STR (short-term rental) metrics.
| Lane | Airbnb-focused path |
|---|---|
| Company hub | /explore/practice/company/airbnb |
| SQL (all) | /explore/practice/company/airbnb/sql |
| Python (all) | /explore/practice/company/airbnb/python |
| By difficulty (easy) | /explore/practice/company/airbnb/difficulty/easy |
Across the platform, 450+ problems with tests and AI feedback are available; use topic and company filters to stay close to the skills above.
Frequently asked questions
What is the Airbnb data engineering interview process?
Rounds often include a recruiter or hiring manager screen, one or more technical sessions (SQL and Python are common), data-centric system or pipeline design, and behavioral interviews. Exact steps vary by level and org—treat the curated Airbnb problem list as pattern prep, not a guarantee of on-site order.
What topics should I study for an Airbnb data engineering interview?
SQL: joins, aggregations, window functions, subqueries, deduplication, and time and session logic. Python: arrays and hashes, intervals (including sweep line), graphs for hierarchical data, and sometimes pandas. Modeling: star schema, facts at clear grain, and SCDs for listing-style change. This guide maps each area to practice links—see the sections and table above.
How difficult are Airbnb data engineering interview questions?
A representative 20-problem company set is often tiered as 6 easy, 11 medium, and 3 hard—medium SQL and join work is the spine; hard items may be data warehouse design or non-trivial graph or DP-style Python depending on the pack. Calibrate with easy vs medium on the same hub.
How should I prepare for an Airbnb data engineering interview?
Solve a mix of company-tagged and global topic problems, verbalize your assumptions, and revisit every miss with a one-paragraph post-mortem (what grain you chose and why). Aim for dozens of reps across SQL and Python before a onsite-style day—not a single cram night.
What programming languages does Airbnb test in data engineering interviews?
SQL and Python are the main language filters in many DE loops; some teams add verbal modeling or system design without a new language. For reps that match, start from Airbnb SQL and Airbnb Python on PipeCode.
Where can I practice Airbnb-style data engineering questions with tests and feedback?
PipeCode offers 450+ problems with automated tests and feedback; the company page for Airbnb is /explore/practice/company/airbnb. You can also browse all company hubs to compare volume and topics across employers.
Start practicing Airbnb data engineering problems
(Closing CTA is injected by regen_blog15.py → convert_cta after regeneration.)




Top comments (0)