Atlassian data engineering interview questions are the most SQL-window-function-heavy loop covered so far. Six of the curated eight problems are SQL—and five of those six lean directly on OVER (...) clauses for ranking, gaps-and-islands runs, frame-based moving averages, and continuous time-series. Expect aggregation for duplicate detection, daily ranking with ROW_NUMBER / RANK, subscription expiry math with date intervals, gaps-and-islands for consecutive runs, 3-day moving averages with ROWS BETWEEN, and continuous-day visitor counts via calendar tables. The two Python problems are classic complements: balanced parentheses with a stack, and first/last position via binary search.
This guide walks through the eight topic clusters Atlassian actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 8-problem Atlassian set (2 easy, 4 medium, 2 hard)—two Hard problems, both SQL window-function time-series, signaling that window-function depth is the make-or-break skill.
Top Atlassian data engineering interview topics
From the Atlassian data engineering practice set, the eight numbered sections below follow this topic map (one row per H2):
| # | Topic (sections 1–8) | Why it shows up at Atlassian |
|---|---|---|
| 1 | SQL aggregation for duplicate detection | Duplicate Employee ID Detection—GROUP BY ... HAVING COUNT(*) > 1. |
| 2 | SQL window functions for daily ranking | Daily Top Visitor by Page Views—ROW_NUMBER() / RANK() partitioned by day. |
| 3 | SQL date functions for subscription expiry calculation | Subscription Expiry Date Calculation—start_date + INTERVAL with plan-aware durations. |
| 4 | SQL gaps-and-islands for consecutive run detection | Stadium Consecutive High-Attendance Days—ROW_NUMBER difference trick. |
| 5 | SQL window moving averages with frame clauses | 3-Day Moving Average of Pages Viewed—ROWS BETWEEN 2 PRECEDING AND CURRENT ROW. |
| 6 | SQL continuous time-series with calendar tables | Continuous Daily Visitors Count—generate_series + LEFT JOIN for fill-zero. |
| 7 | Python stacks for balanced parentheses | Balanced Parentheses—push openers, pop on closers, validate match. |
| 8 | Python binary search for first/last position | First and Last Position of Element in Sorted Array—bisect_left and bisect_right - 1. |
SaaS analytics framing rule: Atlassian's prompts are dressed in B2B SaaS data—page views, subscriptions, daily visitors, attendance counts. The interviewer is grading whether you map the framing to the right SQL primitive: ranking → window with
PARTITION BY day; consecutive runs → gaps-and-islands; moving averages →ROWS BETWEEN; continuous days → calendar table fill-forward. State the mapping out loud before writing code.
1. SQL Aggregation for Duplicate Detection
SQL aggregation for duplicate detection in data engineering
Duplicate detection is the simplest aggregation question and the most-mishandled. The clean answer is one GROUP BY: group by the candidate key, count, filter to groups with count > 1—total cost O(N log N) for the sort, O(N) for the sweep. The mental model is aggregation collapses rows by key; HAVING filters the collapsed groups.
Pro tip: Notice whether the prompt wants the duplicate keys (the
HAVINGquery directly) or all duplicate rows (a join back orIN (SELECT ...)). State the distinction; interviewers grade whether you noticed.
GROUP BY + HAVING COUNT(*) > 1
The canonical pattern: group by the candidate key, count, filter with HAVING. HAVING runs after GROUP BY and aggregates—filtering aggregates in WHERE is invalid because the aggregate isn't computed yet.
-
GROUP BY key— collapses to one row per unique key. -
HAVING COUNT(*) > 1— keeps only keys that appear more than once. -
Evaluation order:
FROM→WHERE→GROUP BY→ aggregates →HAVING→SELECT→ORDER BY. -
Worked example:
employee_idvalues[100, 101, 100, 102, 101]→GROUP BYreturns counts{100:2, 101:2, 102:1};HAVING COUNT(*) > 1keeps{100, 101}.
SELECT employee_id, COUNT(*) AS dup_count
FROM employees
GROUP BY employee_id
HAVING COUNT(*) > 1
ORDER BY employee_id;
COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col)
Three counting variants, three semantics. Pick by what the prompt asks for.
-
COUNT(*)— counts rows in the group, regardless of NULLs. -
COUNT(col)— counts rows wherecolis not NULL. -
COUNT(DISTINCT col)— counts unique non-null values; more expensive because the planner must dedupe first. -
Worked example: salaries
[100, NULL, 200, 100]in one group →COUNT(*) = 4,COUNT(salary) = 3,COUNT(DISTINCT salary) = 2.
Returning all duplicate rows via IN (SELECT ...)
The HAVING query returns duplicate keys. To return the original rows (so you can see what's different between them), filter with IN (SELECT ...) or use a window-function COUNT() OVER.
-
IN (SELECT ...)— two passes: inner produces duplicate keys, outer scans and filters. CostO(N log N) + O(N). -
COUNT(*) OVER (PARTITION BY key)— one pass; usually faster on large tables.
SELECT employee_id, name
FROM (
SELECT *, COUNT(*) OVER (PARTITION BY employee_id) AS cnt
FROM employees
) t
WHERE cnt > 1;
Common beginner mistakes
- Using
COUNT(col)whenCOUNT(*)is correct—silently miscounts rows wherecolis NULL. - Filtering
COUNT(*) > 1inWHEREinstead ofHAVING—WHEREruns before aggregation; the query is invalid. - Forgetting that
IN (NULL, ...)producesUNKNOWNand drops everything—useNOT EXISTSfor negated set checks. - Using a self-join (
employees a JOIN employees b ON a.employee_id = b.employee_id AND a.id != b.id)—works but isO(N²)in the worst case. - Returning duplicate keys when the prompt wanted full duplicate rows.
SQL interview question on duplicate detection
Table employees(employee_id, name, dept_id). Return all employee_ids that appear more than once, alongside the count of duplicates. Sort by dup_count descending, then employee_id ascending.
Solution using GROUP BY + HAVING
SELECT employee_id, COUNT(*) AS dup_count
FROM employees
GROUP BY employee_id
HAVING COUNT(*) > 1
ORDER BY dup_count DESC, employee_id ASC;
Why this works: GROUP BY employee_id collapses the table to one row per unique id; COUNT(*) is the per-id row count; HAVING COUNT(*) > 1 filters to ids with at least one duplicate. The ORDER BY clause uses an aggregate-derived alias, valid in standard SQL because the alias is in scope after SELECT. Total cost: one scan of employees, hash or sort-based aggregation, filter, sort. O(N log N) worst case, O(N) with a covering index on employee_id.
SQL
Topic — aggregation
Aggregation problems
COMPANY
Atlassian — aggregation
Atlassian-tagged aggregation
2. SQL Window Functions for Daily Ranking
SQL window-function ranking in data engineering
Daily-top ranking is the textbook SaaS window-function pattern: for each day, return the visitor with the highest page-view count. The pipeline is PARTITION BY day resets ranking at each day boundary; ORDER BY metric DESC sorts within each day; ROW_NUMBER() = 1 keeps the top per day.
Pro tip: Windows can't be filtered in
WHERE—the evaluation order isFROM → WHERE → GROUP BY → HAVING → window → SELECT. Wrap the window in a CTE, then filter the CTE outside. This is the most common window-function bug.
ROW_NUMBER / RANK / DENSE_RANK semantics
Three ranking functions because three different "rank" semantics show up in real prompts. Pick by the prompt's wording and state the choice out loud.
-
ROW_NUMBER()— unique sequential integer; ties broken arbitrarily. Add a tie-break column (e.g.ORDER BY page_views DESC, visitor_id ASC) for determinism. -
RANK()— Olympic medal logic. Tied rows share a rank; the next rank skips (1, 1, 3). -
DENSE_RANK()—RANKwithout the gap (1, 1, 2). Use when the rank itself is a label or partition key. -
Worked example: page views
[200, 150, 150, 100]ranked descending:
| page_views | row_number | rank | dense_rank |
|---|---|---|---|
| 200 | 1 | 1 | 1 |
| 150 | 2 | 2 | 2 |
| 150 | 3 | 2 | 2 |
| 100 | 4 | 4 | 3 |
PARTITION BY day ORDER BY metric DESC
PARTITION BY buckets rows into independent groups; the window resets at each partition. ORDER BY page_views DESC sorts within each partition; ROW_NUMBER() = 1 is the per-day winner. The pattern is composable—change the partition key (week, month, region) without changing the shape.
-
Tie-break: add a secondary key (
ORDER BY page_views DESC, visitor_id ASC) for deterministic results. - Worked example: two visitors over two days—V1 wins 04-28 (100 views), V2 wins 04-29 (80 views).
SELECT
day, visitor_id, page_views,
ROW_NUMBER() OVER (PARTITION BY day ORDER BY page_views DESC, visitor_id ASC) AS rn
FROM visits;
Filtering on a window column via CTE
Standard SQL evaluates WHERE before windows, so WHERE rn = 1 directly on the windowed column is invalid. The fix is structural: wrap the window in a CTE, then filter outside where rn is a real column.
- CTE wrapper — portable across PostgreSQL, MySQL 8+, and standard SQL.
-
QUALIFY ROW_NUMBER() OVER (...) = 1— Snowflake / BigQuery / Teradata only; cleaner but not portable. - Aggregate-and-self-join — always works; always slower.
-
Worked example: the CTE-wrapped query on the visits data above returns one row per day:
(04-28, V1, 100)and(04-29, V2, 80).
WITH ranked AS (
SELECT
day, visitor_id, page_views,
ROW_NUMBER() OVER (PARTITION BY day ORDER BY page_views DESC, visitor_id ASC) AS rn
FROM visits
)
SELECT day, visitor_id, page_views FROM ranked WHERE rn = 1;
Common beginner mistakes
- Writing
WHERE rn = 1directly on a window column—evaluation order makes it invalid. - Using
RANKwhen the prompt requiresDENSE_RANKorROW_NUMBER—mismatched semantics. - Forgetting
PARTITION BYand ranking globally instead of per-day. - Forgetting the tie-break secondary key—non-deterministic results.
- Filtering window output without a CTE wrapper.
Drill SQL window function patterns →
SQL interview question on daily ranking
Table visits(visitor_id, day, page_views). For each day, return the top-1 visitor by page views (ties broken by visitor_id ascending). Use one CTE.
Solution using ROW_NUMBER partitioned by day
WITH ranked AS (
SELECT
day, visitor_id, page_views,
ROW_NUMBER() OVER (
PARTITION BY day
ORDER BY page_views DESC, visitor_id ASC
) AS rn
FROM visits
)
SELECT day, visitor_id, page_views
FROM ranked
WHERE rn = 1
ORDER BY day;
Why this works: PARTITION BY day resets the ranking at each day boundary, so the window operates per-day. ORDER BY page_views DESC, visitor_id ASC ranks visitors top-first with deterministic tie-break. ROW_NUMBER() = 1 keeps exactly one row per day (vs RANK = 1 which would keep all tied winners). The CTE wrapper is required because windows can't be filtered in WHERE. The outer ORDER BY day produces deterministic output for downstream consumers.
SQL
Topic — window functions
Window function problems
COMPANY
Atlassian — window functions
Atlassian-tagged window functions
3. SQL Date Functions for Subscription Expiry Calculation
SQL date arithmetic in data engineering
Subscription accounting is bread-and-butter B2B SaaS analytics. Given a start_date and a plan with a known duration, compute expiry_date. The clean answer uses interval arithmetic: start_date + INTERVAL '1 month'. Dates support arithmetic just like numbers—but with calendar-aware semantics: clamping for invalid dates, and dialect-specific behavior worth naming aloud.
Pro tip: Real subscription systems have
monthly,annual,biennial,lifetimeplans. Switch the interval with aCASE—or, better, store the duration as a column on the plan and join it in.
start_date + INTERVAL
PostgreSQL's interval arithmetic on dates and timestamps reads mathematically: take the start date, add the interval. The result type matches the input.
-
Common units —
'1 day','1 week','1 month','3 months','1 year','1 year 6 months'. -
Month clamping —
2026-01-31 + INTERVAL '1 month' = 2026-02-28(Feb has 28 days; the planner clamps to the last day). -
Leap-day clamping —
2024-02-29 + INTERVAL '1 year' = 2025-02-28. -
Worked example:
start_date = 2026-01-15,+ INTERVAL '1 month'→2026-02-15.start_date = 2025-04-28,+ INTERVAL '1 year'→2026-04-28.
SELECT
subscription_id,
start_date,
start_date + INTERVAL '1 month' AS monthly_expiry,
start_date + INTERVAL '1 year' AS annual_expiry
FROM subscriptions;
EXTRACT / DATE_PART for components
EXTRACT(field FROM source) and the equivalent DATE_PART(field, source) return a numeric component of a date or timestamp.
-
Common fields —
year,month,day,dow(day of week, 0 = Sunday),doy,quarter,hour,epoch. -
Use cases —
EXTRACT(year FROM start_date)for per-year aggregation;EXTRACT(dow FROM ts)for weekend filters;EXTRACT(epoch FROM (end_ts - start_ts))for duration in seconds. -
Worked example: for
2026-04-28 14:30:00,EXTRACT(year) = 2026,EXTRACT(month) = 4,EXTRACT(dow) = 2(Tuesday),EXTRACT(hour) = 14.
Plan-aware CASE for variable-duration plans
Real subscription systems have multiple plan durations. The cleanest in-line SQL is a CASE expression that returns the appropriate interval per row.
-
ELSE NULL— defends against unknown plan values; the row gets a NULL expiry instead of crashing. -
Production-grade alternative — store durations on a
plansdimension and join: adding a new plan becomes a row, not a code change. -
Worked example:
start_date = 2026-04-28, plan'annual'→expiry_date = 2027-04-28.
SELECT
subscription_id,
start_date,
start_date + (
CASE plan
WHEN 'monthly' THEN INTERVAL '1 month'
WHEN 'annual' THEN INTERVAL '1 year'
WHEN 'biennial' THEN INTERVAL '2 years'
ELSE NULL
END
) AS expiry_date
FROM subscriptions;
Common beginner mistakes
- Hard-coding
start_date + 30(integer days) and missing month/year clamping. - Using
BETWEENfor date-range filters and over-including the upper boundary on timestamp columns. - Forgetting that
EXTRACT(dow ...)is 0-indexed (Sunday = 0)—off-by-one in weekend filters. - Building expiry math with
EXTRACT + CONCATinstead of just adding anINTERVAL. - Ignoring leap-day clamping:
2024-02-29 + INTERVAL '1 year'yields2025-02-28, not the (invalid)2025-02-29.
SQL interview question on subscription expiry
Table subscriptions(subscription_id, start_date, plan) with plan ∈ {'monthly', 'annual', 'biennial'}. Return each subscription with a computed expiry_date. Unknown plans should yield NULL expiry.
Solution using INTERVAL arithmetic and CASE
SELECT
subscription_id,
start_date,
start_date + (
CASE plan
WHEN 'monthly' THEN INTERVAL '1 month'
WHEN 'annual' THEN INTERVAL '1 year'
WHEN 'biennial' THEN INTERVAL '2 years'
ELSE NULL
END
)::date AS expiry_date
FROM subscriptions
ORDER BY subscription_id;
Why this works: The CASE returns the appropriate INTERVAL per row based on plan; adding it to start_date produces the expiry. The ::date cast keeps the result a date (otherwise interval arithmetic on a date returns a timestamp). The ELSE NULL branch catches unknown plans and produces NULL—the query never crashes on bad data, and downstream consumers can filter or alert on NULL expiries. Per-row cost is constant; total is O(N) for the table scan.
SQL
Topic — date functions
Date function problems
SQL
Topic — date arithmetic
Date arithmetic problems
4. SQL Gaps-and-Islands for Consecutive Run Detection
Gaps-and-islands in SQL for data engineering
"Find consecutive runs of rows where some condition holds" is the gaps-and-islands family. The trick: filter to qualifying rows, assign ROW_NUMBER() OVER (ORDER BY date), then date - row_number is constant within each run and different across runs. Group by that constant and you have your islands.
Practice gaps-and-islands SQL problems →
The ROW_NUMBER difference trick
For consecutive dates, each step in date and each step in rn increments by 1, so their difference is invariant. A gap (a non-qualifying day) breaks the invariant—the next qualifying date jumps ahead while rn only advances by 1, so date - rn changes value.
-
Filter first —
WHERE attendance >= thresholdbefore theROW_NUMBER; the trick only works on qualifying rows. -
Use
ROW_NUMBER, notRANK— duplicates in the order key would tie ranks and break the difference. -
Worked example: 5 high-attendance days
[04-01, 04-02, 04-03, 04-05, 04-06]:
| date | row_number | date - rn (days) | island |
|---|---|---|---|
| 04-01 | 1 | 03-31 | A |
| 04-02 | 2 | 03-31 | A |
| 04-03 | 3 | 03-31 | A |
| 04-05 | 4 | 04-01 | B |
| 04-06 | 5 | 04-01 | B |
Identifying islands by group key (date - rn)
Once you have the group_key column, group by it: each group is one island. Per-island aggregates produce the run summary.
-
MIN(date)— run start. -
MAX(date)— run end. -
COUNT(*)— run length. -
AVG(metric)/SUM(metric)— cumulative metrics across the run. -
Note:
group_keyis meaningless after the fact (it's a date offset); discard it.
WITH labeled AS (
SELECT date, attendance,
date - INTERVAL '1 day' * ROW_NUMBER() OVER (ORDER BY date) AS group_key
FROM stadium_attendance
WHERE attendance >= 50000
)
SELECT
MIN(date) AS run_start,
MAX(date) AS run_end,
COUNT(*) AS run_length,
AVG(attendance) AS avg_attendance
FROM labeled
GROUP BY group_key;
Filtering islands of length ≥ N
"Find runs of at least 3 consecutive days" is a HAVING COUNT(*) >= N filter on the GROUP BY query. Same evaluation-order rule as §1: WHERE runs before aggregation, so length filters belong in HAVING.
- Worked example: with threshold 3 on the table above, only the 04-01..04-03 run (length 3) passes; the 04-05..04-06 run (length 2) drops out.
WITH labeled AS (
SELECT date, attendance,
date - INTERVAL '1 day' * ROW_NUMBER() OVER (ORDER BY date) AS group_key
FROM stadium_attendance
WHERE attendance >= 50000
)
SELECT MIN(date) AS run_start, MAX(date) AS run_end, COUNT(*) AS run_length
FROM labeled
GROUP BY group_key
HAVING COUNT(*) >= 3
ORDER BY run_start;
Common beginner mistakes
- Forgetting to filter with
WHERE attendance >= thresholdbefore theROW_NUMBER—the trick only works on the qualifying rows. - Using
RANK()instead ofROW_NUMBER()—if the order key has duplicates, ranks tie and the difference trick breaks. - Forgetting to multiply by
INTERVAL '1 day'for date types—date - integeris invalid syntax in PostgreSQL. - Filtering islands of length ≥ N with
WHERE COUNT(*) >= N—COUNT(*)isn't computed yet inWHERE. - Trying to recover the
group_keyafter the fact—it's an opaque internal value, not a meaningful date.
SQL interview question on gaps-and-islands
Table stadium_attendance(date, attendance). Return each maximal run of consecutive days where attendance >= 50000, with columns run_start, run_end, run_length. Sort by run_start.
Solution using the ROW_NUMBER difference trick
WITH high_days AS (
SELECT date, attendance
FROM stadium_attendance
WHERE attendance >= 50000
),
labeled AS (
SELECT
date, attendance,
date - INTERVAL '1 day' * ROW_NUMBER() OVER (ORDER BY date) AS group_key
FROM high_days
)
SELECT
MIN(date) AS run_start,
MAX(date) AS run_end,
COUNT(*) AS run_length
FROM labeled
GROUP BY group_key
ORDER BY run_start;
Why this works: The first CTE filters to qualifying days. The second CTE assigns a row_number() to each qualifying day in date order, then computes date - rn as a group key. Because consecutive days increment date and rn by exactly 1 each, their difference is constant within a run and changes across runs. GROUP BY group_key collapses each run; MIN(date), MAX(date), COUNT(*) produce the run's start, end, and length. The total cost is O(N log N) for the row-number sort + O(N) for the group-by.
SQL
Topic — gaps and islands
Gaps-and-islands problems
COMPANY
Atlassian — window functions
Atlassian-tagged window functions
5. SQL Window Moving Averages with Frame Clauses
Moving averages and frame clauses in SQL for data engineering
A moving average smooths noisy time-series: the 3-day average of daily page views is less spiky than raw values. The clean answer uses frame clauses in the window: AVG(page_views) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW). The frame defines which rows the aggregate sees per row.
Pro tip: Without an explicit frame, an
AVG() OVER (ORDER BY ...)defaults toRANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW—a running average from the start, not a moving window. Always specify the frame explicitly when correctness matters.
ROWS BETWEEN N PRECEDING AND CURRENT ROW
The frame clause has the form ROWS BETWEEN <start> AND <end>, where each endpoint is UNBOUNDED PRECEDING, N PRECEDING, CURRENT ROW, N FOLLOWING, or UNBOUNDED FOLLOWING.
-
Running total —
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. Aggregate from start to here. -
3-row trailing window —
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW. Current row plus the two before. -
Centered window —
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING. Symmetric around the current row. - Boundary behavior — at the start of the partition, the frame shrinks: the first day's "3-day moving average" is just the first day's value.
- Worked example: daily page views and 3-row trailing average:
| day | page_views | 3-day moving avg |
|---|---|---|
| 04-01 | 100 | 100.00 (1 row) |
| 04-02 | 120 | 110.00 (avg of 100, 120) |
| 04-03 | 80 | 100.00 (avg of 100, 120, 80) |
| 04-04 | 150 | 116.67 (avg of 120, 80, 150) |
| 04-05 | 200 | 143.33 (avg of 80, 150, 200) |
SELECT
day, page_views,
AVG(page_views) OVER (
ORDER BY day
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS moving_avg_3d
FROM page_views_daily;
ROWS vs RANGE for time-series with duplicates
Two frame modes exist; they're identical when the order key has unique values and diverge when there are duplicates.
-
ROWS BETWEEN ... AND ...— counts physical rows.2 PRECEDING= exactly 2 rows back, regardless of value. -
RANGE BETWEEN ... AND ...— counts based on the order-key value.RANGE BETWEEN INTERVAL '2 days' PRECEDING= all rows within 2 days of the current key. -
Default when
ORDER BYis present and no frame is given isRANGE—surprises candidates who expectedROWS. -
Worked example: with duplicates at
09:00:00(values 10 and 20) and one row at09:00:01(value 5),ROWS BETWEEN 2 PRECEDINGalways sees ≤ 3 physical rows;RANGE BETWEEN INTERVAL '1 second' PRECEDINGsees all rows within the last second (could be many more).
Centered moving averages with 1 PRECEDING AND 1 FOLLOWING
Trailing windows lag the underlying signal because older data dominates. Centered windows (1 PRECEDING AND 1 FOLLOWING) are symmetric around the current row—better for historical smoothing. Trade-off: they need future data, so they aren't usable for real-time forecasting.
-
Worked example: on the same daily series above, the centered 3-day average for
04-03is(120 + 80 + 150) / 3 = 116.67; the trailing version was100.00. Frames shrink at both ends.
SELECT
day, page_views,
AVG(page_views) OVER (
ORDER BY day
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
) AS centered_avg
FROM page_views_daily;
Common beginner mistakes
- Forgetting to specify the frame and getting the implicit
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW(running total) instead of a moving window. - Using
ROWSwhenRANGEis correct (or vice versa) on time-series with duplicates. - Forgetting that the frame shrinks at partition boundaries—first and last rows have shorter effective windows.
- Trying to filter by
moving_avg_3d > XinWHERE—same window-function-in-WHERE bug as §2; wrap in a CTE. - Forgetting
ORDER BYinsideOVER ()—the window function then uses arbitrary order, producing nonsense.
SQL interview question on moving averages
Table daily_pageviews(day, page_views). Return each day with its 3-day trailing moving average of page_views. Sort by day.
Solution using a window function with ROWS BETWEEN
SELECT
day,
page_views,
AVG(page_views) OVER (
ORDER BY day
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS moving_avg_3d
FROM daily_pageviews
ORDER BY day;
Why this works: The window's ORDER BY day defines the row sequence; ROWS BETWEEN 2 PRECEDING AND CURRENT ROW defines a 3-row trailing window per row. AVG(page_views) over that window is the moving average. At the start of the series, the window shrinks (first day has 1 row, second day has 2)—this is the expected behavior for trailing windows. Total cost: O(N log N) for the sort + O(N) for the streaming aggregate.
SQL
Topic — moving average
Moving average problems
COMPANY
Atlassian — time series
Atlassian-tagged time series
6. SQL Continuous Time-Series with Calendar Tables
Continuous time-series with calendar tables in SQL for data engineering
Real time-series have gaps—days where no events happened. The fix is a calendar table (or generate_series in PostgreSQL): a synthetic source of consecutive dates, left-joined to the fact table. Days with no fact rows get NULL on the right side; COALESCE fills zero. Without it, charts and rolling-window calculations skip the gap days and produce wrong totals.
Generating a date series with generate_series
PostgreSQL's generate_series(start, end, step) produces a synthetic table of values—for dates, all calendar days in the range as a single column.
-
Static range —
generate_series('2026-04-01'::date, '2026-04-30'::date, INTERVAL '1 day'). -
Dynamic range — parameterize with
(SELECT MIN(day) FROM events)so the series adapts to the data. -
Cast — append
::dateif you wantdateinstead oftimestamp. -
Worked example: generates 5 consecutive days starting
2026-04-28:[04-28, 04-29, 04-30, 05-01, 05-02].
SELECT generate_series(
'2026-04-28'::date,
'2026-05-02'::date,
INTERVAL '1 day'
)::date AS day;
LEFT JOIN calendar to fact table for fill-zero
Combine generate_series with LEFT JOIN to the fact table to fill missing days with zero. This is the SQL equivalent of pandas's df.reindex(date_range, fill_value=0).
-
LEFT JOIN— keeps every calendar row regardless of whether the fact has it. -
COALESCE(metric, 0)— converts the NULL on missing days to zero. -
Worked example: facts only have rows for
04-28(10 visitors) and04-30(5 visitors). After the calendar join:04-28: 10,04-29: 0(filled),04-30: 5,05-01: 0(filled),05-02: 0(filled).
WITH calendar AS (
SELECT generate_series('2026-04-01'::date, '2026-04-30'::date, INTERVAL '1 day')::date AS day
),
visits AS (
SELECT day, COUNT(DISTINCT visitor_id) AS visitor_count
FROM page_visits
WHERE day BETWEEN '2026-04-01' AND '2026-04-30'
GROUP BY day
)
SELECT
c.day,
COALESCE(v.visitor_count, 0) AS visitors
FROM calendar c
LEFT JOIN visits v ON v.day = c.day
ORDER BY c.day;
Continuous visitor count: LEAD/LAG to detect breaks
Once the calendar join produces a gap-free series, LAG(visitors) gives yesterday's count and LEAD(visitors) gives tomorrow's. Combine with the §4 gaps-and-islands trick to find the longest run of consecutive days with at least one visitor.
- Composition — calendar table fills gaps; gaps-and-islands then identifies runs in the gap-free series.
-
Worked example: with the calendar-filled data above, days where
visitors > 0are[04-28, 04-30]—non-consecutive (because 04-29 had zero). Longest run length = 1.
WITH dense AS (
SELECT c.day, COALESCE(v.visitor_count, 0) AS visitors
FROM calendar c LEFT JOIN visits v USING (day)
),
labeled AS (
SELECT day,
day - INTERVAL '1 day' * ROW_NUMBER() OVER (ORDER BY day) AS group_key
FROM dense
WHERE visitors > 0
)
SELECT MIN(day) AS run_start, MAX(day) AS run_end, COUNT(*) AS run_length
FROM labeled
GROUP BY group_key
ORDER BY run_length DESC
LIMIT 1;
Common beginner mistakes
- Using
INNER JOINinstead ofLEFT JOIN—drops the calendar rows that have no fact match (the whole point is to keep them). - Forgetting
COALESCE(metric, 0)and getting NULL in the output instead of zero. - Hardcoding the calendar bounds and missing days outside that range.
- Generating the calendar inside a subquery without indexing—works but slow on large date ranges.
- Reaching for
generate_serieswhen a realdim_calendartable exists in the warehouse—use the existing table; it's pre-indexed and includes business attributes (holidays, fiscal weeks).
SQL interview question on continuous time-series
Table daily_visitors(day, visitor_count) has rows only for days with visitors. Return a gap-free time series for the date range [start, end] with zero-fill for missing days, plus a 7-day trailing sum of visitors.
Solution using generate_series + LEFT JOIN + window-function rolling sum
WITH calendar AS (
SELECT generate_series('2026-04-01'::date, '2026-04-30'::date, INTERVAL '1 day')::date AS day
),
filled AS (
SELECT
c.day,
COALESCE(v.visitor_count, 0) AS visitors
FROM calendar c
LEFT JOIN daily_visitors v ON v.day = c.day
)
SELECT
day,
visitors,
SUM(visitors) OVER (
ORDER BY day
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7d_sum
FROM filled
ORDER BY day;
Why this works: generate_series produces a complete date sequence regardless of fact-table gaps. LEFT JOIN attaches actual visitor counts, with NULL on missing days; COALESCE(... , 0) fills NULLs with zero. The final SUM() OVER (... ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) is a 7-row trailing window aggregate, computed correctly on the gap-free series. If we'd skipped the calendar join, the 7-day rolling sum would have been computed over fewer rows (skipping zero-visitor days), producing wrong totals.
SQL
Topic — time series
Time-series problems
COMPANY
Atlassian — time series
Atlassian-tagged time series
7. Python Stacks for Balanced Parentheses
Stacks for matching brackets in Python for data engineering
Balanced parentheses is the textbook stack interview—a candidate who can't write it cleanly in 10 lines will get filtered at the phone screen. Walk the string, push openers ((, [, {) onto a stack, pop on closers (), ], }) and verify the popped opener matches. Empty stack at end-of-string means balanced; a non-empty stack means unmatched openers; a mismatched pop means crossed brackets.
See more stack and binary-search problems →
Pro tip: A Python
listis the right data structure:appendto push,popto pop, bothO(1)amortized. For 1M-character input the algorithm runs in 1M ops with at most depth-many slots in the stack.
Stack basics: list.append / list.pop for push/pop
Python's list works as a LIFO stack at the right end. The four idioms cover everything you need.
-
stack.append(x)— push,O(1)amortized. -
top = stack.pop()— pop and use,O(1). -
if not stack: ...— empty check. -
top = stack[-1]— peek without popping. -
Avoid
list.pop(0)— that'sO(n); for stacks use the right end only. -
Worked example: push
(,[,{then pop twice → stack ends as['('], returning'{'then'['.
Push openers, pop on closers, validate match
The full algorithm is a single pass over the string with a pairs map from closer to opener.
-
Pairs map —
{")": "(", "]": "[", "}": "{"}. -
For each char: push if opener; if closer, pop and verify match (return
Falseon empty stack or mismatch). -
End-of-string: non-empty stack →
False; empty →True. -
Complexity —
O(N)time,O(N)worst-case space (deeply nested input like(((...)))). -
Worked example: trace of
s = "({[]})":
| char | action | stack after |
|---|---|---|
( |
push | ['('] |
{ |
push | ['(', '{'] |
[ |
push | ['(', '{', '['] |
] |
pop, [ == [ ✓ |
['(', '{'] |
} |
pop, { == { ✓ |
['('] |
) |
pop, ( == ( ✓ |
[] |
End: stack empty → balanced.
def is_balanced(s: str) -> bool:
pairs = {")": "(", "]": "[", "}": "{"}
stack: list[str] = []
for ch in s:
if ch in "({[":
stack.append(ch)
elif ch in ")}]":
if not stack or stack.pop() != pairs[ch]:
return False
return not stack
Generalizing to multiple bracket types via dict
The pairs dict is the key to generalization—adding new bracket types is a one-line dict change, the algorithm doesn't.
-
Add a pair —
pairs["⟧"] = "⟦"extends to math notation; same loop. - Tags / richer formats — combine a stack with a tokenizer; pure stack-based parsing handles only the simple bracket case.
-
Cleaner shape — derive
openers = set(pairs.values())so the loop tests against the dict, not hardcoded strings.
def is_balanced(s: str) -> bool:
pairs = {")": "(", "]": "[", "}": "{"}
openers = set(pairs.values())
stack: list[str] = []
for ch in s:
if ch in openers:
stack.append(ch)
elif ch in pairs:
if not stack or stack.pop() != pairs[ch]:
return False
return not stack
Common beginner mistakes
- Forgetting to check empty stack before popping—
IndexErroron inputs like). - Using
list.pop(0)(FIFO) instead oflist.pop()(LIFO)—wrong stack semantics, also O(n). - Hardcoding
if ch == "(" and stack[-1] == ")"instead of using a pairs dict—doesn't generalize. - Returning
Truemid-loop on a successful match—you must check end-of-string with empty stack. - Ignoring non-bracket characters (e.g. text inside parens like
"(hello)")—need to skip them or handle explicitly.
Python interview question on balanced parentheses
Implement is_balanced(s) that returns True if all brackets in s ((), [], {}) are properly matched and nested. Non-bracket characters in the string should be ignored. Use O(n) time and O(n) space.
Solution using a stack and a pairs dict
def is_balanced(s: str) -> bool:
pairs = {")": "(", "]": "[", "}": "{"}
openers = set(pairs.values())
stack: list[str] = []
for ch in s:
if ch in openers:
stack.append(ch)
elif ch in pairs:
if not stack or stack.pop() != pairs[ch]:
return False
return not stack
Why this works: Each character is examined once (O(n) time). Openers push to the stack; closers pop and verify match against the pairs dict. The empty-stack check before pop catches "closer without opener" cases (e.g. ) standalone). The mismatch check catches crossed brackets (e.g. ([)]). The final return not stack catches unclosed openers (e.g. ((). Non-bracket characters fall through both if branches and are ignored. Memory is O(n) worst case for fully-nested input like (((...))).
PYTHON
Topic — stack
Stack problems
PYTHON
Topic — string
String problems
8. Python Binary Search for First/Last Position
Binary search for first/last position in Python for data engineering
Given a sorted array with duplicates, return the first and last indices where a target appears. Linear scan is O(n); bisect is O(log n)—a six-orders-of-magnitude speedup on a 10M-element array (10M ops vs 24 ops). The mental model: bisect_left returns the leftmost insertion point; bisect_right returns the rightmost. For a present target, [bisect_left, bisect_right - 1] is the inclusive index range.
Pro tip: When
bisect_left == bisect_right, the target is absent—return[-1, -1]. This single comparison is cheaper and clearer than separate "is it present" checks.
bisect_left for first occurrence
bisect.bisect_left(arr, x) returns the leftmost insertion point: the first index i such that arr[i] >= x. For a present value, this is the first occurrence; for an absent value, the spot to insert.
-
Lower-bound semantics — for duplicates of
x, points at the leftmost. -
Presence check —
i < len(arr) and arr[i] == x. -
Worked example:
arr = [1, 2, 2, 2, 3, 5],x = 2:
| function | result | meaning |
|---|---|---|
bisect_left(arr, 2) |
1 | leftmost 2 is at index 1 |
bisect_right(arr, 2) |
4 | one past the rightmost 2 |
bisect_right - 1 |
3 | rightmost 2 |
from bisect import bisect_left
def first_occurrence(arr: list[int], x: int) -> int:
i = bisect_left(arr, x)
if i < len(arr) and arr[i] == x:
return i
return -1
bisect_right - 1 for last occurrence
bisect.bisect_right(arr, x) returns the rightmost insertion point: the first index where arr[i] > x. Subtract 1 for the last occurrence of x itself.
-
Upper-bound semantics — for duplicates of
x, points one past the rightmost. -
Guard against negative-index — check
j > 0 and arr[j - 1] == xbefore usingj - 1; otherwise Python silently indexes from the end onj == 0. -
Worked example: same array,
x = 2.bisect_right(arr, 2) = 4; last occurrence at index 3;arr[3] == 2✓.
from bisect import bisect_right
def last_occurrence(arr: list[int], x: int) -> int:
j = bisect_right(arr, x)
if j > 0 and arr[j - 1] == x:
return j - 1
return -1
Two-pointer alternative when bisect isn't available
When the interview restricts bisect, code the binary-search loop from scratch with lo, hi = 0, len(arr) half-open range and mid = (lo + hi) // 2.
-
First-occurrence comparison —
if arr[mid] < x: lo = mid + 1; else: hi = mid. -
Last-occurrence comparison —
if arr[mid] <= x: lo = mid + 1; else: hi = mid, thenlast = lo - 1. -
Loop condition —
while lo < hi(not<=); termination atlo == hi. -
Worked example:
arr = [1, 2, 2, 2, 3, 5],x = 2, manual first-occurrence:
| lo | hi | mid | arr[mid] | action |
|---|---|---|---|---|
| 0 | 6 | 3 | 2 | not <, hi = 3 |
| 0 | 3 | 1 | 2 | not <, hi = 1 |
| 0 | 1 | 0 | 1 | <, lo = 1 |
| 1 | 1 | — | — | exit |
Result: lo = 1. arr[1] == 2 ✓ → first occurrence is 1.
def first_occurrence_manual(arr: list[int], x: int) -> int:
lo, hi = 0, len(arr)
while lo < hi:
mid = (lo + hi) // 2
if arr[mid] < x:
lo = mid + 1
else:
hi = mid
if lo < len(arr) and arr[lo] == x:
return lo
return -1
Common beginner mistakes
- Using
bisect_leftfor last occurrence (gives first), orbisect_rightfor first (gives one past last). - Forgetting the
j > 0guard beforearr[j-1]—negative-index bug whenx < arr[0]. - Using
while lo <= hiinstead ofwhile lo < hi—infinite loop whenlo == hi. - Computing
mid = (lo + hi) / 2(true division, gives float)—use//(integer division). - Forgetting that
bisectrequires sorted input—Python won't warn.
Python interview question on first/last position
Given a sorted array arr and a target value x, return the first and last positions of x in arr as [first, last]. If x is absent, return [-1, -1]. Use O(log n) time.
Solution using bisect_left and bisect_right
from bisect import bisect_left, bisect_right
def first_last(arr: list[int], x: int) -> list[int]:
lo = bisect_left(arr, x)
hi = bisect_right(arr, x)
if lo == hi:
return [-1, -1]
return [lo, hi - 1]
Why this works: bisect_left returns the leftmost insertion point (first position of x if present). bisect_right returns the rightmost insertion point (one past the last position of x). If x isn't present, the two return the same value (no slot exists for it); we detect this with lo == hi and return [-1, -1]. Otherwise [lo, hi - 1] is the inclusive range of x positions. Total: two O(log n) calls plus a constant comparison—net O(log n) time, O(1) extra space.
PYTHON
Topic — binary search
Binary search problems
PYTHON
Topic — two pointers
Two-pointer problems
Tips to crack Atlassian data engineering interviews
These are habits that move the needle in real Atlassian loops—not a re-statement of the topics above.
SQL window-function preparation
Spend most of your SQL prep on window functions—Atlassian's curated set is window-function-heavy: ranking, gaps-and-islands, frame-based moving averages, and continuous time-series. Memorize the four flavors: ROW_NUMBER/RANK/DENSE_RANK for ranking; ROW_NUMBER difference trick for gaps-and-islands; ROWS BETWEEN N PRECEDING AND CURRENT ROW for moving averages; generate_series + LEFT JOIN + LEAD/LAG for continuous time-series. The window functions, gaps-and-islands, and moving-average topic pages cover the spread.
SQL date function preparation
Drill INTERVAL arithmetic for subscription expiry, DATE_TRUNC for time-bucketing, EXTRACT for component access, and generate_series for calendar tables. Atlassian's prompts dress all of these in B2B SaaS data—subscriptions, page views, daily visitors—but the SQL primitives are the same. The date functions topic page covers the bulk.
Python preparation
Atlassian's Python rounds are minimal: just two patterns in the curated set. The stack pattern (balanced parentheses) and binary search (first/last position). Memorize list.append/list.pop, the pairs-dict idiom, bisect_left/bisect_right, and the manual binary-search template. Drill problems from the stack and binary search topic pages.
SaaS analytics framing
Atlassian's prompts dress general SQL window functions in B2B SaaS data: subscriptions, page views, daily visitors, attendance counts. The interviewer is grading whether you map the framing to the algorithm correctly. State the mapping out loud: "this is daily ranking, use ROW_NUMBER partitioned by day"; "this is consecutive runs, use the ROW_NUMBER difference trick"; "this is a moving average, use ROWS BETWEEN"; "this is gap-filling, use generate_series + LEFT JOIN."
Where to practice on PipeCode
| Skill lane | Practice path |
|---|---|
| Curated Atlassian practice set | /explore/practice/company/atlassian |
| SQL window functions | /explore/practice/topic/window-functions |
| Gaps and islands | /explore/practice/topic/gaps-and-islands |
| Moving average | /explore/practice/topic/moving-average |
| Time series | /explore/practice/topic/time-series |
| Aggregation | /explore/practice/topic/aggregation |
| Date functions | /explore/practice/topic/date-functions |
| Stack | /explore/practice/topic/stack |
| Binary search | /explore/practice/topic/binary-search |
| All practice topics | /explore/practice/topics |
| Interview courses | /explore/courses |
Communication under time pressure
State assumptions before typing: "I'll assume timestamps are sorted within each partition"; "I'll assume the date range fits in memory for the calendar table"; "I'll assume strict (not loose) gaps in the gaps-and-islands query." State invariants after key code blocks. State complexity: "this is O(N log N) for the sort + O(N) for the streaming aggregate." Interviewers grade clear reasoning above silent-and-perfect.
Frequently Asked Questions
What is the Atlassian data engineering interview process like?
The Atlassian data engineering interview typically includes a phone screen (mostly SQL warm-up around aggregation or window functions), one or two coding rounds focused on SQL window functions and time-series patterns, plus a Python round on stack or binary-search patterns. There's also a system-design conversation around pipelines and SaaS analytics architectures, and behavioral interviews. The curated 8-problem Atlassian practice set on PipeCode mirrors what you will see on the technical rounds.
What SQL topics does Atlassian test for data engineers?
Atlassian emphasizes window functions above all else—ranking with ROW_NUMBER/RANK, gaps-and-islands with the ROW_NUMBER difference trick, frame-based moving averages with ROWS BETWEEN, and continuous time-series with generate_series + LEFT JOIN. Plus aggregation for duplicate detection and date-interval arithmetic for subscription expiry. The window functions, gaps-and-islands, and moving-average topic pages cover the bulk.
How important are window functions in the Atlassian data engineering interview?
Window functions are the make-or-break skill: 5 of the 8 curated problems use OVER (...) clauses, including both Hard problems. Memorize the four flavors covered in §2, §4, §5, and §6 of this guide. A candidate who can't write RANK() OVER (PARTITION BY ... ORDER BY ...) from memory will struggle.
How important is Python for an Atlassian data engineering interview?
Python is a smaller part of the loop—just two patterns in the curated set: stack-based bracket validation and bisect-based first/last position search. Memorize the canonical implementations in §7 and §8. Stdlib fluency (bisect, collections) matters; deep algorithm knowledge less so for this loop.
What's the gaps-and-islands pattern Atlassian asks?
Gaps-and-islands finds runs of consecutive rows where some condition holds. The trick: filter to qualifying rows, assign ROW_NUMBER() OVER (ORDER BY date), then group by date - rn—the difference is constant within a run and changes across runs. The full pattern with worked example is in §4. The gaps-and-islands topic page has problems matching this shape directly.
How many Atlassian practice problems should I solve before the interview?
Aim for 30–50 problems spanning all eight topic clusters above—not 100 of the same kind. Solve every problem in the Atlassian-tagged practice set, then back-fill weak areas using the topic pages linked throughout this guide. Pay extra attention to the four window-function variants; that's where Atlassian distinguishes between candidates.
Start practicing Atlassian data engineering problems
Reading patterns is not the same as typing them under time pressure. PipeCode pairs company-tagged Atlassian problems with tests, AI feedback, and a coding environment so you can drill the exact SQL window-function and Python patterns Atlassian asks—without the noise of generic algorithm prep that doesn't apply to this loop.
Pipecode.ai is Leetcode for Data Engineering.



Top comments (0)