Gowtham Potureddi

Posted on Apr 28

Atlassian Data Engineering Interview Questions

#python #sql #interview #dataengineering

Atlassian data engineering interview questions are the most SQL-window-function-heavy loop covered so far. Six of the curated eight problems are SQL—and five of those six lean directly on OVER (...) clauses for ranking, gaps-and-islands runs, frame-based moving averages, and continuous time-series. Expect aggregation for duplicate detection, daily ranking with ROW_NUMBER / RANK, subscription expiry math with date intervals, gaps-and-islands for consecutive runs, 3-day moving averages with ROWS BETWEEN, and continuous-day visitor counts via calendar tables. The two Python problems are classic complements: balanced parentheses with a stack, and first/last position via binary search.

This guide walks through the eight topic clusters Atlassian actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 8-problem Atlassian set (2 easy, 4 medium, 2 hard)—two Hard problems, both SQL window-function time-series, signaling that window-function depth is the make-or-break skill.

#	Topic (sections 1–8)	Why it shows up at Atlassian
1	SQL aggregation for duplicate detection	Duplicate Employee ID Detection—`GROUP BY ... HAVING COUNT(*) > 1`.
2	SQL window functions for daily ranking	Daily Top Visitor by Page Views—`ROW_NUMBER()` / `RANK()` partitioned by day.
3	SQL date functions for subscription expiry calculation	Subscription Expiry Date Calculation—`start_date + INTERVAL` with plan-aware durations.
4	SQL gaps-and-islands for consecutive run detection	Stadium Consecutive High-Attendance Days—`ROW_NUMBER` difference trick.
5	SQL window moving averages with frame clauses	3-Day Moving Average of Pages Viewed—`ROWS BETWEEN 2 PRECEDING AND CURRENT ROW`.
6	SQL continuous time-series with calendar tables	Continuous Daily Visitors Count—`generate_series` + `LEFT JOIN` for fill-zero.
7	Python stacks for balanced parentheses	Balanced Parentheses—push openers, pop on closers, validate match.
8	Python binary search for first/last position	First and Last Position of Element in Sorted Array—`bisect_left` and `bisect_right - 1`.

1. SQL Aggregation for Duplicate Detection

SQL aggregation for duplicate detection in data engineering

Duplicate detection is the simplest aggregation question and the most-mishandled. The clean answer is one GROUP BY: group by the candidate key, count, filter to groups with count > 1—total cost O(N log N) for the sort, O(N) for the sweep. The mental model is aggregation collapses rows by key; HAVING filters the collapsed groups.

Pro tip: Notice whether the prompt wants the duplicate keys (the HAVING query directly) or all duplicate rows (a join back or IN (SELECT ...)). State the distinction; interviewers grade whether you noticed.

`GROUP BY` + `HAVING COUNT(*) > 1`

The canonical pattern: group by the candidate key, count, filter with HAVING. HAVING runs after GROUP BY and aggregates—filtering aggregates in WHERE is invalid because the aggregate isn't computed yet.

GROUP BY key — collapses to one row per unique key.
HAVING COUNT(*) > 1 — keeps only keys that appear more than once.
Evaluation order: FROM → WHERE → GROUP BY → aggregates → HAVING → SELECT → ORDER BY.
Worked example: employee_id values [100, 101, 100, 102, 101] → GROUP BY returns counts {100:2, 101:2, 102:1}; HAVING COUNT(*) > 1 keeps {100, 101}.

SELECT employee_id, COUNT(*) AS dup_count
FROM employees
GROUP BY employee_id
HAVING COUNT(*) > 1
ORDER BY employee_id;

`COUNT(*)` vs `COUNT(col)` vs `COUNT(DISTINCT col)`

Three counting variants, three semantics. Pick by what the prompt asks for.

COUNT(*) — counts rows in the group, regardless of NULLs.
COUNT(col) — counts rows where col is not NULL.
COUNT(DISTINCT col) — counts unique non-null values; more expensive because the planner must dedupe first.
Worked example: salaries [100, NULL, 200, 100] in one group → COUNT(*) = 4, COUNT(salary) = 3, COUNT(DISTINCT salary) = 2.

Returning all duplicate rows via `IN (SELECT ...)`

The HAVING query returns duplicate keys. To return the original rows (so you can see what's different between them), filter with IN (SELECT ...) or use a window-function COUNT() OVER.

IN (SELECT ...) — two passes: inner produces duplicate keys, outer scans and filters. Cost O(N log N) + O(N).
COUNT(*) OVER (PARTITION BY key) — one pass; usually faster on large tables.

SELECT employee_id, name
FROM (
  SELECT *, COUNT(*) OVER (PARTITION BY employee_id) AS cnt
  FROM employees
) t
WHERE cnt > 1;

Common beginner mistakes

Using COUNT(col) when COUNT(*) is correct—silently miscounts rows where col is NULL.
Filtering COUNT(*) > 1 in WHERE instead of HAVING—WHERE runs before aggregation; the query is invalid.
Forgetting that IN (NULL, ...) produces UNKNOWN and drops everything—use NOT EXISTS for negated set checks.
Using a self-join (employees a JOIN employees b ON a.employee_id = b.employee_id AND a.id != b.id)—works but is O(N²) in the worst case.
Returning duplicate keys when the prompt wanted full duplicate rows.

SQL interview question on duplicate detection

Table employees(employee_id, name, dept_id). Return all employee_ids that appear more than once, alongside the count of duplicates. Sort by dup_count descending, then employee_id ascending.

Solution using `GROUP BY` + `HAVING`

SELECT employee_id, COUNT(*) AS dup_count
FROM employees
GROUP BY employee_id
HAVING COUNT(*) > 1
ORDER BY dup_count DESC, employee_id ASC;

Why this works: GROUP BY employee_id collapses the table to one row per unique id; COUNT(*) is the per-id row count; HAVING COUNT(*) > 1 filters to ids with at least one duplicate. The ORDER BY clause uses an aggregate-derived alias, valid in standard SQL because the alias is in scope after SELECT. Total cost: one scan of employees, hash or sort-based aggregation, filter, sort. O(N log N) worst case, O(N) with a covering index on employee_id.

SQL
Topic — aggregation
Aggregation problems

Practice →

COMPANY
Atlassian — aggregation
Atlassian-tagged aggregation

Practice →

2. SQL Window Functions for Daily Ranking

SQL window-function ranking in data engineering

Daily-top ranking is the textbook SaaS window-function pattern: for each day, return the visitor with the highest page-view count. The pipeline is PARTITION BY day resets ranking at each day boundary; ORDER BY metric DESC sorts within each day; ROW_NUMBER() = 1 keeps the top per day.

Pro tip: Windows can't be filtered in WHERE—the evaluation order is FROM → WHERE → GROUP BY → HAVING → window → SELECT. Wrap the window in a CTE, then filter the CTE outside. This is the most common window-function bug.

`ROW_NUMBER` / `RANK` / `DENSE_RANK` semantics

Three ranking functions because three different "rank" semantics show up in real prompts. Pick by the prompt's wording and state the choice out loud.

ROW_NUMBER() — unique sequential integer; ties broken arbitrarily. Add a tie-break column (e.g. ORDER BY page_views DESC, visitor_id ASC) for determinism.
RANK() — Olympic medal logic. Tied rows share a rank; the next rank skips (1, 1, 3).
DENSE_RANK() — RANK without the gap (1, 1, 2). Use when the rank itself is a label or partition key.
Worked example: page views [200, 150, 150, 100] ranked descending:

page_views	row_number	rank	dense_rank
200	1	1	1
150	2	2	2
150	3	2	2
100	4	4	3

`PARTITION BY day ORDER BY metric DESC`

PARTITION BY buckets rows into independent groups; the window resets at each partition. ORDER BY page_views DESC sorts within each partition; ROW_NUMBER() = 1 is the per-day winner. The pattern is composable—change the partition key (week, month, region) without changing the shape.

Tie-break: add a secondary key (ORDER BY page_views DESC, visitor_id ASC) for deterministic results.
Worked example: two visitors over two days—V1 wins 04-28 (100 views), V2 wins 04-29 (80 views).

SELECT
  day, visitor_id, page_views,
  ROW_NUMBER() OVER (PARTITION BY day ORDER BY page_views DESC, visitor_id ASC) AS rn
FROM visits;

Filtering on a window column via CTE

Standard SQL evaluates WHERE before windows, so WHERE rn = 1 directly on the windowed column is invalid. The fix is structural: wrap the window in a CTE, then filter outside where rn is a real column.

CTE wrapper — portable across PostgreSQL, MySQL 8+, and standard SQL.
QUALIFY ROW_NUMBER() OVER (...) = 1 — Snowflake / BigQuery / Teradata only; cleaner but not portable.
Aggregate-and-self-join — always works; always slower.
Worked example: the CTE-wrapped query on the visits data above returns one row per day: (04-28, V1, 100) and (04-29, V2, 80).

WITH ranked AS (
  SELECT
    day, visitor_id, page_views,
    ROW_NUMBER() OVER (PARTITION BY day ORDER BY page_views DESC, visitor_id ASC) AS rn
  FROM visits
)
SELECT day, visitor_id, page_views FROM ranked WHERE rn = 1;

Common beginner mistakes

Writing WHERE rn = 1 directly on a window column—evaluation order makes it invalid.
Using RANK when the prompt requires DENSE_RANK or ROW_NUMBER—mismatched semantics.
Forgetting PARTITION BY and ranking globally instead of per-day.
Forgetting the tie-break secondary key—non-deterministic results.
Filtering window output without a CTE wrapper.

Drill SQL window function patterns →

SQL interview question on daily ranking

Table visits(visitor_id, day, page_views). For each day, return the top-1 visitor by page views (ties broken by visitor_id ascending). Use one CTE.

Solution using `ROW_NUMBER` partitioned by day

WITH ranked AS (
  SELECT
    day, visitor_id, page_views,
    ROW_NUMBER() OVER (
      PARTITION BY day
      ORDER BY page_views DESC, visitor_id ASC
    ) AS rn
  FROM visits
)
SELECT day, visitor_id, page_views
FROM ranked
WHERE rn = 1
ORDER BY day;

Why this works: PARTITION BY day resets the ranking at each day boundary, so the window operates per-day. ORDER BY page_views DESC, visitor_id ASC ranks visitors top-first with deterministic tie-break. ROW_NUMBER() = 1 keeps exactly one row per day (vs RANK = 1 which would keep all tied winners). The CTE wrapper is required because windows can't be filtered in WHERE. The outer ORDER BY day produces deterministic output for downstream consumers.

SQL
Topic — window functions
Window function problems

Practice →

COMPANY
Atlassian — window functions
Atlassian-tagged window functions

Practice →

3. SQL Date Functions for Subscription Expiry Calculation

SQL date arithmetic in data engineering

Subscription accounting is bread-and-butter B2B SaaS analytics. Given a start_date and a plan with a known duration, compute expiry_date. The clean answer uses interval arithmetic: start_date + INTERVAL '1 month'. Dates support arithmetic just like numbers—but with calendar-aware semantics: clamping for invalid dates, and dialect-specific behavior worth naming aloud.

Pro tip: Real subscription systems have monthly, annual, biennial, lifetime plans. Switch the interval with a CASE—or, better, store the duration as a column on the plan and join it in.

`start_date + INTERVAL`

PostgreSQL's interval arithmetic on dates and timestamps reads mathematically: take the start date, add the interval. The result type matches the input.

Common units — '1 day', '1 week', '1 month', '3 months', '1 year', '1 year 6 months'.
Month clamping — 2026-01-31 + INTERVAL '1 month' = 2026-02-28 (Feb has 28 days; the planner clamps to the last day).
Leap-day clamping — 2024-02-29 + INTERVAL '1 year' = 2025-02-28.
Worked example: start_date = 2026-01-15, + INTERVAL '1 month' → 2026-02-15. start_date = 2025-04-28, + INTERVAL '1 year' → 2026-04-28.

SELECT
  subscription_id,
  start_date,
  start_date + INTERVAL '1 month' AS monthly_expiry,
  start_date + INTERVAL '1 year'  AS annual_expiry
FROM subscriptions;

`EXTRACT` / `DATE_PART` for components

EXTRACT(field FROM source) and the equivalent DATE_PART(field, source) return a numeric component of a date or timestamp.

Common fields — year, month, day, dow (day of week, 0 = Sunday), doy, quarter, hour, epoch.
Use cases — EXTRACT(year FROM start_date) for per-year aggregation; EXTRACT(dow FROM ts) for weekend filters; EXTRACT(epoch FROM (end_ts - start_ts)) for duration in seconds.
Worked example: for 2026-04-28 14:30:00, EXTRACT(year) = 2026, EXTRACT(month) = 4, EXTRACT(dow) = 2 (Tuesday), EXTRACT(hour) = 14.

Plan-aware `CASE` for variable-duration plans

Real subscription systems have multiple plan durations. The cleanest in-line SQL is a CASE expression that returns the appropriate interval per row.

ELSE NULL — defends against unknown plan values; the row gets a NULL expiry instead of crashing.
Production-grade alternative — store durations on a plans dimension and join: adding a new plan becomes a row, not a code change.
Worked example: start_date = 2026-04-28, plan 'annual' → expiry_date = 2027-04-28.

SELECT
  subscription_id,
  start_date,
  start_date + (
    CASE plan
      WHEN 'monthly'  THEN INTERVAL '1 month'
      WHEN 'annual'   THEN INTERVAL '1 year'
      WHEN 'biennial' THEN INTERVAL '2 years'
      ELSE NULL
    END
  ) AS expiry_date
FROM subscriptions;

Common beginner mistakes

Hard-coding start_date + 30 (integer days) and missing month/year clamping.
Using BETWEEN for date-range filters and over-including the upper boundary on timestamp columns.
Forgetting that EXTRACT(dow ...) is 0-indexed (Sunday = 0)—off-by-one in weekend filters.
Building expiry math with EXTRACT + CONCAT instead of just adding an INTERVAL.
Ignoring leap-day clamping: 2024-02-29 + INTERVAL '1 year' yields 2025-02-28, not the (invalid) 2025-02-29.

SQL interview question on subscription expiry

Table subscriptions(subscription_id, start_date, plan) with plan ∈ {'monthly', 'annual', 'biennial'}. Return each subscription with a computed expiry_date. Unknown plans should yield NULL expiry.

Solution using `INTERVAL` arithmetic and `CASE`

SELECT
  subscription_id,
  start_date,
  start_date + (
    CASE plan
      WHEN 'monthly'  THEN INTERVAL '1 month'
      WHEN 'annual'   THEN INTERVAL '1 year'
      WHEN 'biennial' THEN INTERVAL '2 years'
      ELSE NULL
    END
  )::date AS expiry_date
FROM subscriptions
ORDER BY subscription_id;

Why this works: The CASE returns the appropriate INTERVAL per row based on plan; adding it to start_date produces the expiry. The ::date cast keeps the result a date (otherwise interval arithmetic on a date returns a timestamp). The ELSE NULL branch catches unknown plans and produces NULL—the query never crashes on bad data, and downstream consumers can filter or alert on NULL expiries. Per-row cost is constant; total is O(N) for the table scan.

SQL
Topic — date functions
Date function problems

Practice →

SQL
Topic — date arithmetic
Date arithmetic problems

Practice →

4. SQL Gaps-and-Islands for Consecutive Run Detection

Gaps-and-islands in SQL for data engineering

"Find consecutive runs of rows where some condition holds" is the gaps-and-islands family. The trick: filter to qualifying rows, assign ROW_NUMBER() OVER (ORDER BY date), then date - row_number is constant within each run and different across runs. Group by that constant and you have your islands.

Practice gaps-and-islands SQL problems →

The `ROW_NUMBER` difference trick

For consecutive dates, each step in date and each step in rn increments by 1, so their difference is invariant. A gap (a non-qualifying day) breaks the invariant—the next qualifying date jumps ahead while rn only advances by 1, so date - rn changes value.

Filter first — WHERE attendance >= threshold before the ROW_NUMBER; the trick only works on qualifying rows.
Use ROW_NUMBER, not RANK — duplicates in the order key would tie ranks and break the difference.
Worked example: 5 high-attendance days [04-01, 04-02, 04-03, 04-05, 04-06]:

date	row_number	date - rn (days)	island
04-01	1	03-31	A
04-02	2	03-31	A
04-03	3	03-31	A
04-05	4	04-01	B
04-06	5	04-01	B

Identifying islands by group key (date - rn)

Once you have the group_key column, group by it: each group is one island. Per-island aggregates produce the run summary.

MIN(date) — run start.
MAX(date) — run end.
COUNT(*) — run length.
AVG(metric) / SUM(metric) — cumulative metrics across the run.
Note: group_key is meaningless after the fact (it's a date offset); discard it.

WITH labeled AS (
  SELECT date, attendance,
         date - INTERVAL '1 day' * ROW_NUMBER() OVER (ORDER BY date) AS group_key
  FROM stadium_attendance
  WHERE attendance >= 50000
)
SELECT
  MIN(date) AS run_start,
  MAX(date) AS run_end,
  COUNT(*)  AS run_length,
  AVG(attendance) AS avg_attendance
FROM labeled
GROUP BY group_key;

Filtering islands of length ≥ N

"Find runs of at least 3 consecutive days" is a HAVING COUNT(*) >= N filter on the GROUP BY query. Same evaluation-order rule as §1: WHERE runs before aggregation, so length filters belong in HAVING.

Worked example: with threshold 3 on the table above, only the 04-01..04-03 run (length 3) passes; the 04-05..04-06 run (length 2) drops out.

WITH labeled AS (
  SELECT date, attendance,
         date - INTERVAL '1 day' * ROW_NUMBER() OVER (ORDER BY date) AS group_key
  FROM stadium_attendance
  WHERE attendance >= 50000
)
SELECT MIN(date) AS run_start, MAX(date) AS run_end, COUNT(*) AS run_length
FROM labeled
GROUP BY group_key
HAVING COUNT(*) >= 3
ORDER BY run_start;

Common beginner mistakes

Forgetting to filter with WHERE attendance >= threshold before the ROW_NUMBER—the trick only works on the qualifying rows.
Using RANK() instead of ROW_NUMBER()—if the order key has duplicates, ranks tie and the difference trick breaks.
Forgetting to multiply by INTERVAL '1 day' for date types—date - integer is invalid syntax in PostgreSQL.
Filtering islands of length ≥ N with WHERE COUNT(*) >= N—COUNT(*) isn't computed yet in WHERE.
Trying to recover the group_key after the fact—it's an opaque internal value, not a meaningful date.

SQL interview question on gaps-and-islands

Table stadium_attendance(date, attendance). Return each maximal run of consecutive days where attendance >= 50000, with columns run_start, run_end, run_length. Sort by run_start.

Solution using the `ROW_NUMBER` difference trick

WITH high_days AS (
  SELECT date, attendance
  FROM stadium_attendance
  WHERE attendance >= 50000
),
labeled AS (
  SELECT
    date, attendance,
    date - INTERVAL '1 day' * ROW_NUMBER() OVER (ORDER BY date) AS group_key
  FROM high_days
)
SELECT
  MIN(date) AS run_start,
  MAX(date) AS run_end,
  COUNT(*)  AS run_length
FROM labeled
GROUP BY group_key
ORDER BY run_start;

Why this works: The first CTE filters to qualifying days. The second CTE assigns a row_number() to each qualifying day in date order, then computes date - rn as a group key. Because consecutive days increment date and rn by exactly 1 each, their difference is constant within a run and changes across runs. GROUP BY group_key collapses each run; MIN(date), MAX(date), COUNT(*) produce the run's start, end, and length. The total cost is O(N log N) for the row-number sort + O(N) for the group-by.

SQL
Topic — gaps and islands
Gaps-and-islands problems

Practice →

COMPANY
Atlassian — window functions
Atlassian-tagged window functions

Practice →

5. SQL Window Moving Averages with Frame Clauses

Moving averages and frame clauses in SQL for data engineering

A moving average smooths noisy time-series: the 3-day average of daily page views is less spiky than raw values. The clean answer uses frame clauses in the window: AVG(page_views) OVER (ORDER BY date ROWS BETWEEN 2 PRECEDING AND CURRENT ROW). The frame defines which rows the aggregate sees per row.

Pro tip: Without an explicit frame, an AVG() OVER (ORDER BY ...) defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW—a running average from the start, not a moving window. Always specify the frame explicitly when correctness matters.

`ROWS BETWEEN N PRECEDING AND CURRENT ROW`

The frame clause has the form ROWS BETWEEN <start> AND <end>, where each endpoint is UNBOUNDED PRECEDING, N PRECEDING, CURRENT ROW, N FOLLOWING, or UNBOUNDED FOLLOWING.

Running total — ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. Aggregate from start to here.
3-row trailing window — ROWS BETWEEN 2 PRECEDING AND CURRENT ROW. Current row plus the two before.
Centered window — ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING. Symmetric around the current row.
Boundary behavior — at the start of the partition, the frame shrinks: the first day's "3-day moving average" is just the first day's value.
Worked example: daily page views and 3-row trailing average:

day	page_views	3-day moving avg
04-01	100	100.00 (1 row)
04-02	120	110.00 (avg of 100, 120)
04-03	80	100.00 (avg of 100, 120, 80)
04-04	150	116.67 (avg of 120, 80, 150)
04-05	200	143.33 (avg of 80, 150, 200)

SELECT
  day, page_views,
  AVG(page_views) OVER (
    ORDER BY day
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) AS moving_avg_3d
FROM page_views_daily;

`ROWS` vs `RANGE` for time-series with duplicates

Two frame modes exist; they're identical when the order key has unique values and diverge when there are duplicates.

ROWS BETWEEN ... AND ... — counts physical rows. 2 PRECEDING = exactly 2 rows back, regardless of value.
RANGE BETWEEN ... AND ... — counts based on the order-key value. RANGE BETWEEN INTERVAL '2 days' PRECEDING = all rows within 2 days of the current key.
Default when ORDER BY is present and no frame is given is RANGE—surprises candidates who expected ROWS.
Worked example: with duplicates at 09:00:00 (values 10 and 20) and one row at 09:00:01 (value 5), ROWS BETWEEN 2 PRECEDING always sees ≤ 3 physical rows; RANGE BETWEEN INTERVAL '1 second' PRECEDING sees all rows within the last second (could be many more).

Centered moving averages with `1 PRECEDING AND 1 FOLLOWING`

Trailing windows lag the underlying signal because older data dominates. Centered windows (1 PRECEDING AND 1 FOLLOWING) are symmetric around the current row—better for historical smoothing. Trade-off: they need future data, so they aren't usable for real-time forecasting.

Worked example: on the same daily series above, the centered 3-day average for 04-03 is (120 + 80 + 150) / 3 = 116.67; the trailing version was 100.00. Frames shrink at both ends.

SELECT
  day, page_views,
  AVG(page_views) OVER (
    ORDER BY day
    ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
  ) AS centered_avg
FROM page_views_daily;

Common beginner mistakes

Forgetting to specify the frame and getting the implicit RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW (running total) instead of a moving window.
Using ROWS when RANGE is correct (or vice versa) on time-series with duplicates.
Forgetting that the frame shrinks at partition boundaries—first and last rows have shorter effective windows.
Trying to filter by moving_avg_3d > X in WHERE—same window-function-in-WHERE bug as §2; wrap in a CTE.
Forgetting ORDER BY inside OVER ()—the window function then uses arbitrary order, producing nonsense.

SQL interview question on moving averages

Table daily_pageviews(day, page_views). Return each day with its 3-day trailing moving average of page_views. Sort by day.

Solution using a window function with `ROWS BETWEEN`

SELECT
  day,
  page_views,
  AVG(page_views) OVER (
    ORDER BY day
    ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
  ) AS moving_avg_3d
FROM daily_pageviews
ORDER BY day;

Why this works: The window's ORDER BY day defines the row sequence; ROWS BETWEEN 2 PRECEDING AND CURRENT ROW defines a 3-row trailing window per row. AVG(page_views) over that window is the moving average. At the start of the series, the window shrinks (first day has 1 row, second day has 2)—this is the expected behavior for trailing windows. Total cost: O(N log N) for the sort + O(N) for the streaming aggregate.

SQL
Topic — moving average
Moving average problems

Practice →

COMPANY
Atlassian — time series
Atlassian-tagged time series

Practice →

6. SQL Continuous Time-Series with Calendar Tables

Continuous time-series with calendar tables in SQL for data engineering

Real time-series have gaps—days where no events happened. The fix is a calendar table (or generate_series in PostgreSQL): a synthetic source of consecutive dates, left-joined to the fact table. Days with no fact rows get NULL on the right side; COALESCE fills zero. Without it, charts and rolling-window calculations skip the gap days and produce wrong totals.

Generating a date series with `generate_series`

PostgreSQL's generate_series(start, end, step) produces a synthetic table of values—for dates, all calendar days in the range as a single column.

Static range — generate_series('2026-04-01'::date, '2026-04-30'::date, INTERVAL '1 day').
Dynamic range — parameterize with (SELECT MIN(day) FROM events) so the series adapts to the data.
Cast — append ::date if you want date instead of timestamp.
Worked example: generates 5 consecutive days starting 2026-04-28: [04-28, 04-29, 04-30, 05-01, 05-02].

SELECT generate_series(
  '2026-04-28'::date,
  '2026-05-02'::date,
  INTERVAL '1 day'
)::date AS day;

`LEFT JOIN` calendar to fact table for fill-zero

Combine generate_series with LEFT JOIN to the fact table to fill missing days with zero. This is the SQL equivalent of pandas's df.reindex(date_range, fill_value=0).

LEFT JOIN — keeps every calendar row regardless of whether the fact has it.
COALESCE(metric, 0) — converts the NULL on missing days to zero.
Worked example: facts only have rows for 04-28 (10 visitors) and 04-30 (5 visitors). After the calendar join: 04-28: 10, 04-29: 0 (filled), 04-30: 5, 05-01: 0 (filled), 05-02: 0 (filled).

WITH calendar AS (
  SELECT generate_series('2026-04-01'::date, '2026-04-30'::date, INTERVAL '1 day')::date AS day
),
visits AS (
  SELECT day, COUNT(DISTINCT visitor_id) AS visitor_count
  FROM page_visits
  WHERE day BETWEEN '2026-04-01' AND '2026-04-30'
  GROUP BY day
)
SELECT
  c.day,
  COALESCE(v.visitor_count, 0) AS visitors
FROM calendar c
LEFT JOIN visits v ON v.day = c.day
ORDER BY c.day;

Continuous visitor count: `LEAD/LAG` to detect breaks

Once the calendar join produces a gap-free series, LAG(visitors) gives yesterday's count and LEAD(visitors) gives tomorrow's. Combine with the §4 gaps-and-islands trick to find the longest run of consecutive days with at least one visitor.

Composition — calendar table fills gaps; gaps-and-islands then identifies runs in the gap-free series.
Worked example: with the calendar-filled data above, days where visitors > 0 are [04-28, 04-30]—non-consecutive (because 04-29 had zero). Longest run length = 1.

WITH dense AS (
  SELECT c.day, COALESCE(v.visitor_count, 0) AS visitors
  FROM calendar c LEFT JOIN visits v USING (day)
),
labeled AS (
  SELECT day,
         day - INTERVAL '1 day' * ROW_NUMBER() OVER (ORDER BY day) AS group_key
  FROM dense
  WHERE visitors > 0
)
SELECT MIN(day) AS run_start, MAX(day) AS run_end, COUNT(*) AS run_length
FROM labeled
GROUP BY group_key
ORDER BY run_length DESC
LIMIT 1;

Common beginner mistakes

Using INNER JOIN instead of LEFT JOIN—drops the calendar rows that have no fact match (the whole point is to keep them).
Forgetting COALESCE(metric, 0) and getting NULL in the output instead of zero.
Hardcoding the calendar bounds and missing days outside that range.
Generating the calendar inside a subquery without indexing—works but slow on large date ranges.
Reaching for generate_series when a real dim_calendar table exists in the warehouse—use the existing table; it's pre-indexed and includes business attributes (holidays, fiscal weeks).

SQL interview question on continuous time-series

Table daily_visitors(day, visitor_count) has rows only for days with visitors. Return a gap-free time series for the date range [start, end] with zero-fill for missing days, plus a 7-day trailing sum of visitors.

Solution using `generate_series` + `LEFT JOIN` + window-function rolling sum

WITH calendar AS (
  SELECT generate_series('2026-04-01'::date, '2026-04-30'::date, INTERVAL '1 day')::date AS day
),
filled AS (
  SELECT
    c.day,
    COALESCE(v.visitor_count, 0) AS visitors
  FROM calendar c
  LEFT JOIN daily_visitors v ON v.day = c.day
)
SELECT
  day,
  visitors,
  SUM(visitors) OVER (
    ORDER BY day
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS rolling_7d_sum
FROM filled
ORDER BY day;

Why this works: generate_series produces a complete date sequence regardless of fact-table gaps. LEFT JOIN attaches actual visitor counts, with NULL on missing days; COALESCE(... , 0) fills NULLs with zero. The final SUM() OVER (... ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) is a 7-row trailing window aggregate, computed correctly on the gap-free series. If we'd skipped the calendar join, the 7-day rolling sum would have been computed over fewer rows (skipping zero-visitor days), producing wrong totals.

SQL
Topic — time series
Time-series problems

Practice →

COMPANY
Atlassian — time series
Atlassian-tagged time series

Practice →

7. Python Stacks for Balanced Parentheses

Stacks for matching brackets in Python for data engineering

Balanced parentheses is the textbook stack interview—a candidate who can't write it cleanly in 10 lines will get filtered at the phone screen. Walk the string, push openers ((, [, {) onto a stack, pop on closers (), ], }) and verify the popped opener matches. Empty stack at end-of-string means balanced; a non-empty stack means unmatched openers; a mismatched pop means crossed brackets.

See more stack and binary-search problems →

Pro tip: A Python list is the right data structure: append to push, pop to pop, both O(1) amortized. For 1M-character input the algorithm runs in 1M ops with at most depth-many slots in the stack.

Stack basics: `list.append` / `list.pop` for push/pop

Python's list works as a LIFO stack at the right end. The four idioms cover everything you need.

stack.append(x) — push, O(1) amortized.
top = stack.pop() — pop and use, O(1).
if not stack: ... — empty check.
top = stack[-1] — peek without popping.
Avoid list.pop(0) — that's O(n); for stacks use the right end only.
Worked example: push (, [, { then pop twice → stack ends as ['('], returning '{' then '['.

Push openers, pop on closers, validate match

The full algorithm is a single pass over the string with a pairs map from closer to opener.

Pairs map — {")": "(", "]": "[", "}": "{"}.
For each char: push if opener; if closer, pop and verify match (return False on empty stack or mismatch).
End-of-string: non-empty stack → False; empty → True.
Complexity — O(N) time, O(N) worst-case space (deeply nested input like (((...)))).
Worked example: trace of s = "({[]})":

char	action	stack after
`(`	push	`['(']`
`{`	push	`['(', '{']`
`[`	push	`['(', '{', '[']`
`]`	pop, `[ == [` ✓	`['(', '{']`
`}`	pop, `{ == {` ✓	`['(']`
`)`	pop, `( == (` ✓	`[]`

End: stack empty → balanced.

def is_balanced(s: str) -> bool:
    pairs = {")": "(", "]": "[", "}": "{"}
    stack: list[str] = []
    for ch in s:
        if ch in "({[":
            stack.append(ch)
        elif ch in ")}]":
            if not stack or stack.pop() != pairs[ch]:
                return False
    return not stack

Generalizing to multiple bracket types via dict

The pairs dict is the key to generalization—adding new bracket types is a one-line dict change, the algorithm doesn't.

Add a pair — pairs["⟧"] = "⟦" extends to math notation; same loop.
Tags / richer formats — combine a stack with a tokenizer; pure stack-based parsing handles only the simple bracket case.
Cleaner shape — derive openers = set(pairs.values()) so the loop tests against the dict, not hardcoded strings.

def is_balanced(s: str) -> bool:
    pairs = {")": "(", "]": "[", "}": "{"}
    openers = set(pairs.values())
    stack: list[str] = []
    for ch in s:
        if ch in openers:
            stack.append(ch)
        elif ch in pairs:
            if not stack or stack.pop() != pairs[ch]:
                return False
    return not stack

Common beginner mistakes

Forgetting to check empty stack before popping—IndexError on inputs like ).
Using list.pop(0) (FIFO) instead of list.pop() (LIFO)—wrong stack semantics, also O(n).
Hardcoding if ch == "(" and stack[-1] == ")" instead of using a pairs dict—doesn't generalize.
Returning True mid-loop on a successful match—you must check end-of-string with empty stack.
Ignoring non-bracket characters (e.g. text inside parens like "(hello)")—need to skip them or handle explicitly.

Python interview question on balanced parentheses

Implement is_balanced(s) that returns True if all brackets in s ((), [], {}) are properly matched and nested. Non-bracket characters in the string should be ignored. Use O(n) time and O(n) space.

Solution using a stack and a pairs dict

def is_balanced(s: str) -> bool:
    pairs = {")": "(", "]": "[", "}": "{"}
    openers = set(pairs.values())
    stack: list[str] = []
    for ch in s:
        if ch in openers:
            stack.append(ch)
        elif ch in pairs:
            if not stack or stack.pop() != pairs[ch]:
                return False
    return not stack

Why this works: Each character is examined once (O(n) time). Openers push to the stack; closers pop and verify match against the pairs dict. The empty-stack check before pop catches "closer without opener" cases (e.g. ) standalone). The mismatch check catches crossed brackets (e.g. ([)]). The final return not stack catches unclosed openers (e.g. ((). Non-bracket characters fall through both if branches and are ignored. Memory is O(n) worst case for fully-nested input like (((...))).

PYTHON
Topic — stack
Stack problems

Practice →

PYTHON
Topic — string
String problems

Practice →

8. Python Binary Search for First/Last Position

Binary search for first/last position in Python for data engineering

Given a sorted array with duplicates, return the first and last indices where a target appears. Linear scan is O(n); bisect is O(log n)—a six-orders-of-magnitude speedup on a 10M-element array (10M ops vs 24 ops). The mental model: bisect_left returns the leftmost insertion point; bisect_right returns the rightmost. For a present target, [bisect_left, bisect_right - 1] is the inclusive index range.

Pro tip: When bisect_left == bisect_right, the target is absent—return [-1, -1]. This single comparison is cheaper and clearer than separate "is it present" checks.

`bisect_left` for first occurrence

bisect.bisect_left(arr, x) returns the leftmost insertion point: the first index i such that arr[i] >= x. For a present value, this is the first occurrence; for an absent value, the spot to insert.

Lower-bound semantics — for duplicates of x, points at the leftmost.
Presence check — i < len(arr) and arr[i] == x.
Worked example: arr = [1, 2, 2, 2, 3, 5], x = 2:

function	result	meaning
`bisect_left(arr, 2)`	1	leftmost 2 is at index 1
`bisect_right(arr, 2)`	4	one past the rightmost 2
`bisect_right - 1`	3	rightmost 2

from bisect import bisect_left

def first_occurrence(arr: list[int], x: int) -> int:
    i = bisect_left(arr, x)
    if i < len(arr) and arr[i] == x:
        return i
    return -1

`bisect_right - 1` for last occurrence

bisect.bisect_right(arr, x) returns the rightmost insertion point: the first index where arr[i] > x. Subtract 1 for the last occurrence of x itself.

Upper-bound semantics — for duplicates of x, points one past the rightmost.
Guard against negative-index — check j > 0 and arr[j - 1] == x before using j - 1; otherwise Python silently indexes from the end on j == 0.
Worked example: same array, x = 2. bisect_right(arr, 2) = 4; last occurrence at index 3; arr[3] == 2 ✓.

from bisect import bisect_right

def last_occurrence(arr: list[int], x: int) -> int:
    j = bisect_right(arr, x)
    if j > 0 and arr[j - 1] == x:
        return j - 1
    return -1

Two-pointer alternative when `bisect` isn't available

When the interview restricts bisect, code the binary-search loop from scratch with lo, hi = 0, len(arr) half-open range and mid = (lo + hi) // 2.

First-occurrence comparison — if arr[mid] < x: lo = mid + 1; else: hi = mid.
Last-occurrence comparison — if arr[mid] <= x: lo = mid + 1; else: hi = mid, then last = lo - 1.
Loop condition — while lo < hi (not <=); termination at lo == hi.
Worked example: arr = [1, 2, 2, 2, 3, 5], x = 2, manual first-occurrence:

lo	hi	mid	arr[mid]	action
0	6	3	2	not <, hi = 3
0	3	1	2	not <, hi = 1
0	1	0	1	<, lo = 1
1	1	—	—	exit

Result: lo = 1. arr[1] == 2 ✓ → first occurrence is 1.

def first_occurrence_manual(arr: list[int], x: int) -> int:
    lo, hi = 0, len(arr)
    while lo < hi:
        mid = (lo + hi) // 2
        if arr[mid] < x:
            lo = mid + 1
        else:
            hi = mid
    if lo < len(arr) and arr[lo] == x:
        return lo
    return -1

Common beginner mistakes

Using bisect_left for last occurrence (gives first), or bisect_right for first (gives one past last).
Forgetting the j > 0 guard before arr[j-1]—negative-index bug when x < arr[0].
Using while lo <= hi instead of while lo < hi—infinite loop when lo == hi.
Computing mid = (lo + hi) / 2 (true division, gives float)—use // (integer division).
Forgetting that bisect requires sorted input—Python won't warn.

Python interview question on first/last position

Given a sorted array arr and a target value x, return the first and last positions of x in arr as [first, last]. If x is absent, return [-1, -1]. Use O(log n) time.

Solution using `bisect_left` and `bisect_right`

from bisect import bisect_left, bisect_right

def first_last(arr: list[int], x: int) -> list[int]:
    lo = bisect_left(arr, x)
    hi = bisect_right(arr, x)
    if lo == hi:
        return [-1, -1]
    return [lo, hi - 1]

Why this works: bisect_left returns the leftmost insertion point (first position of x if present). bisect_right returns the rightmost insertion point (one past the last position of x). If x isn't present, the two return the same value (no slot exists for it); we detect this with lo == hi and return [-1, -1]. Otherwise [lo, hi - 1] is the inclusive range of x positions. Total: two O(log n) calls plus a constant comparison—net O(log n) time, O(1) extra space.

PYTHON
Topic — binary search
Binary search problems

Practice →

PYTHON
Topic — two pointers
Two-pointer problems

Practice →

Tips to crack Atlassian data engineering interviews

These are habits that move the needle in real Atlassian loops—not a re-statement of the topics above.

SQL window-function preparation

Spend most of your SQL prep on window functions—Atlassian's curated set is window-function-heavy: ranking, gaps-and-islands, frame-based moving averages, and continuous time-series. Memorize the four flavors: ROW_NUMBER/RANK/DENSE_RANK for ranking; ROW_NUMBER difference trick for gaps-and-islands; ROWS BETWEEN N PRECEDING AND CURRENT ROW for moving averages; generate_series + LEFT JOIN + LEAD/LAG for continuous time-series. The window functions, gaps-and-islands, and moving-average topic pages cover the spread.

SQL date function preparation

Drill INTERVAL arithmetic for subscription expiry, DATE_TRUNC for time-bucketing, EXTRACT for component access, and generate_series for calendar tables. Atlassian's prompts dress all of these in B2B SaaS data—subscriptions, page views, daily visitors—but the SQL primitives are the same. The date functions topic page covers the bulk.

Python preparation

Atlassian's Python rounds are minimal: just two patterns in the curated set. The stack pattern (balanced parentheses) and binary search (first/last position). Memorize list.append/list.pop, the pairs-dict idiom, bisect_left/bisect_right, and the manual binary-search template. Drill problems from the stack and binary search topic pages.

SaaS analytics framing

Atlassian's prompts dress general SQL window functions in B2B SaaS data: subscriptions, page views, daily visitors, attendance counts. The interviewer is grading whether you map the framing to the algorithm correctly. State the mapping out loud: "this is daily ranking, use ROW_NUMBER partitioned by day"; "this is consecutive runs, use the ROW_NUMBER difference trick"; "this is a moving average, use ROWS BETWEEN"; "this is gap-filling, use generate_series + LEFT JOIN."

Where to practice on PipeCode

Skill lane	Practice path
Curated Atlassian practice set	/explore/practice/company/atlassian
SQL window functions	/explore/practice/topic/window-functions
Gaps and islands	/explore/practice/topic/gaps-and-islands
Moving average	/explore/practice/topic/moving-average
Time series	/explore/practice/topic/time-series
Aggregation	/explore/practice/topic/aggregation
Date functions	/explore/practice/topic/date-functions
Stack	/explore/practice/topic/stack
Binary search	/explore/practice/topic/binary-search
All practice topics	/explore/practice/topics
Interview courses	/explore/courses

Communication under time pressure

State assumptions before typing: "I'll assume timestamps are sorted within each partition"; "I'll assume the date range fits in memory for the calendar table"; "I'll assume strict (not loose) gaps in the gaps-and-islands query." State invariants after key code blocks. State complexity: "this is O(N log N) for the sort + O(N) for the streaming aggregate." Interviewers grade clear reasoning above silent-and-perfect.

Frequently Asked Questions

What is the Atlassian data engineering interview process like?

The Atlassian data engineering interview typically includes a phone screen (mostly SQL warm-up around aggregation or window functions), one or two coding rounds focused on SQL window functions and time-series patterns, plus a Python round on stack or binary-search patterns. There's also a system-design conversation around pipelines and SaaS analytics architectures, and behavioral interviews. The curated 8-problem Atlassian practice set on PipeCode mirrors what you will see on the technical rounds.

What SQL topics does Atlassian test for data engineers?

Atlassian emphasizes window functions above all else—ranking with ROW_NUMBER/RANK, gaps-and-islands with the ROW_NUMBER difference trick, frame-based moving averages with ROWS BETWEEN, and continuous time-series with generate_series + LEFT JOIN. Plus aggregation for duplicate detection and date-interval arithmetic for subscription expiry. The window functions, gaps-and-islands, and moving-average topic pages cover the bulk.

How important are window functions in the Atlassian data engineering interview?

Window functions are the make-or-break skill: 5 of the 8 curated problems use OVER (...) clauses, including both Hard problems. Memorize the four flavors covered in §2, §4, §5, and §6 of this guide. A candidate who can't write RANK() OVER (PARTITION BY ... ORDER BY ...) from memory will struggle.

How important is Python for an Atlassian data engineering interview?

Python is a smaller part of the loop—just two patterns in the curated set: stack-based bracket validation and bisect-based first/last position search. Memorize the canonical implementations in §7 and §8. Stdlib fluency (bisect, collections) matters; deep algorithm knowledge less so for this loop.

What's the gaps-and-islands pattern Atlassian asks?

Gaps-and-islands finds runs of consecutive rows where some condition holds. The trick: filter to qualifying rows, assign ROW_NUMBER() OVER (ORDER BY date), then group by date - rn—the difference is constant within a run and changes across runs. The full pattern with worked example is in §4. The gaps-and-islands topic page has problems matching this shape directly.

How many Atlassian practice problems should I solve before the interview?

Aim for 30–50 problems spanning all eight topic clusters above—not 100 of the same kind. Solve every problem in the Atlassian-tagged practice set, then back-fill weak areas using the topic pages linked throughout this guide. Pay extra attention to the four window-function variants; that's where Atlassian distinguishes between candidates.

Start practicing Atlassian data engineering problems

Reading patterns is not the same as typing them under time pressure. PipeCode pairs company-tagged Atlassian problems with tests, AI feedback, and a coding environment so you can drill the exact SQL window-function and Python patterns Atlassian asks—without the noise of generic algorithm prep that doesn't apply to this loop.

Pipecode.ai is Leetcode for Data Engineering.

Browse Atlassian practice →
View all practice →

Top Atlassian data engineering interview topics

1. SQL Aggregation for Duplicate Detection

SQL aggregation for duplicate detection in data engineering

GROUP BY + HAVING COUNT(*) > 1

COUNT(*) vs COUNT(col) vs COUNT(DISTINCT col)

Returning all duplicate rows via IN (SELECT ...)

SQL interview question on duplicate detection

Solution using GROUP BY + HAVING

2. SQL Window Functions for Daily Ranking

SQL window-function ranking in data engineering

ROW_NUMBER / RANK / DENSE_RANK semantics

PARTITION BY day ORDER BY metric DESC

Filtering on a window column via CTE

SQL interview question on daily ranking

Solution using ROW_NUMBER partitioned by day

3. SQL Date Functions for Subscription Expiry Calculation

SQL date arithmetic in data engineering

start_date + INTERVAL

EXTRACT / DATE_PART for components

Plan-aware CASE for variable-duration plans

SQL interview question on subscription expiry

Solution using INTERVAL arithmetic and CASE

4. SQL Gaps-and-Islands for Consecutive Run Detection

Gaps-and-islands in SQL for data engineering

The ROW_NUMBER difference trick

Identifying islands by group key (date - rn)

Filtering islands of length ≥ N

SQL interview question on gaps-and-islands

Solution using the ROW_NUMBER difference trick

5. SQL Window Moving Averages with Frame Clauses

Moving averages and frame clauses in SQL for data engineering

ROWS BETWEEN N PRECEDING AND CURRENT ROW

ROWS vs RANGE for time-series with duplicates

Centered moving averages with 1 PRECEDING AND 1 FOLLOWING

SQL interview question on moving averages

Solution using a window function with ROWS BETWEEN

6. SQL Continuous Time-Series with Calendar Tables

Continuous time-series with calendar tables in SQL for data engineering

Generating a date series with generate_series

LEFT JOIN calendar to fact table for fill-zero

Continuous visitor count: LEAD/LAG to detect breaks

SQL interview question on continuous time-series

Solution using generate_series + LEFT JOIN + window-function rolling sum

7. Python Stacks for Balanced Parentheses

Stacks for matching brackets in Python for data engineering

Stack basics: list.append / list.pop for push/pop

Push openers, pop on closers, validate match

Generalizing to multiple bracket types via dict

Python interview question on balanced parentheses

Solution using a stack and a pairs dict

8. Python Binary Search for First/Last Position

Binary search for first/last position in Python for data engineering

bisect_left for first occurrence

bisect_right - 1 for last occurrence

Two-pointer alternative when bisect isn't available

Python interview question on first/last position

Solution using bisect_left and bisect_right

Tips to crack Atlassian data engineering interviews

SQL window-function preparation

SQL date function preparation

Python preparation

SaaS analytics framing

Where to practice on PipeCode

Communication under time pressure

Frequently Asked Questions

What is the Atlassian data engineering interview process like?

What SQL topics does Atlassian test for data engineers?

How important are window functions in the Atlassian data engineering interview?

How important is Python for an Atlassian data engineering interview?

What's the gaps-and-islands pattern Atlassian asks?

How many Atlassian practice problems should I solve before the interview?

Start practicing Atlassian data engineering problems

`GROUP BY` + `HAVING COUNT(*) > 1`

`COUNT(*)` vs `COUNT(col)` vs `COUNT(DISTINCT col)`

Returning all duplicate rows via `IN (SELECT ...)`

Solution using `GROUP BY` + `HAVING`

`ROW_NUMBER` / `RANK` / `DENSE_RANK` semantics

`PARTITION BY day ORDER BY metric DESC`

Solution using `ROW_NUMBER` partitioned by day

`start_date + INTERVAL`

`EXTRACT` / `DATE_PART` for components

Plan-aware `CASE` for variable-duration plans

Solution using `INTERVAL` arithmetic and `CASE`

The `ROW_NUMBER` difference trick

Solution using the `ROW_NUMBER` difference trick

`ROWS BETWEEN N PRECEDING AND CURRENT ROW`

`ROWS` vs `RANGE` for time-series with duplicates

Centered moving averages with `1 PRECEDING AND 1 FOLLOWING`

Solution using a window function with `ROWS BETWEEN`

Generating a date series with `generate_series`

`LEFT JOIN` calendar to fact table for fill-zero

Continuous visitor count: `LEAD/LAG` to detect breaks

Solution using `generate_series` + `LEFT JOIN` + window-function rolling sum

Stack basics: `list.append` / `list.pop` for push/pop

`bisect_left` for first occurrence

`bisect_right - 1` for last occurrence

Two-pointer alternative when `bisect` isn't available

Solution using `bisect_left` and `bisect_right`