Gowtham Potureddi

Posted on Apr 30

Oracle Data Engineering Interview Questions

#dataengineering #interview #python #sql

Oracle data engineering interview questions are a balanced fundamentals + classics tour: three Python primitives (stack-based bracket validation, hash-table two-sum, dict / JSON extraction) and four SQL primitives (aggregation with joins for percentage calculations, self-joins with date functions for roundtrip detection, window-function ranking for top-K per group, and date functions for KPI analysis). The framings span restaurants, flights, coffee shops, and ride-sharing—domain-agnostic interview classics, not Oracle-Database trivia.

This guide walks through the seven topic clusters Oracle actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 7-problem Oracle set (4 easy, 3 medium, 0 hard)—the most fundamentals-friendly company hub of the seven covered so far, no Hard tier and a balanced SQL/Python ratio.

#	Topic (sections 1–7)	Why it shows up at Oracle
1	Python stacks for valid brackets	Valid Brackets—push openers, pop on closers, validate the match.
2	Python hash tables for two-sum	Two Sum—single-pass complement lookup, O(n).
3	Python dicts for JSON extraction	JSON User Name Extraction—`json.loads` plus chained dict access.
4	SQL aggregation with joins for percentage calculations	Percentage of Vegetarian-Only Restaurants—`COUNT` filtered with `CASE WHEN`, divide by total.
5	SQL self-joins with date functions for roundtrip detection	Roundtrip Flights Query—self-join with origin↔dest swap and date filter.
6	SQL window-function ranking for top-K per group	Coffee Shop Top 3 Products—`ROW_NUMBER()` partitioned by shop, filter to top 3.
7	SQL date functions for KPI analysis	Ride-Sharing KPI Analysis—`DATE_TRUNC` for time bucketing, `EXTRACT` for duration.

1. Python Stacks for Valid Brackets

Stacks for matching brackets in Python for data engineering

Walk a string of brackets; push openers, pop on closers, verify match; return True if balanced. The mental model: a stack tracks "what we expect to close next." Each opener pushes its expectation; each closer pops the most-recent expectation and verifies it matches. A naive count-of-each-bracket fails on crossed inputs like ([)]—the stack pattern catches them.

Pro tip: Use a Python list (append / pop at the right end) as the stack—both O(1) amortized. list.pop(0) is O(n) and silently wrong for LIFO semantics.

Stack basics: `list.append` / `list.pop`

A Python list is the canonical LIFO stack. The invariant: the stack at any moment contains the un-closed openers in order; the top is the most recent unclosed opener.

stack.append(x) — push onto the right end, O(1) amortized.
stack.pop() — remove and return the rightmost, O(1).
stack[-1] — peek without popping.
if not stack — empty check before popping (otherwise IndexError).
Worked example: push (, [, { → stack is ['(', '[', '{']; one pop() returns '{' and leaves ['(', '['].

Push openers, pop on closers, validate match

The full algorithm: define a pairs map from closer to its matching opener, walk the string, push openers, and on each closer pop+verify. The invariant: at any point, the stack reflects the chain of unclosed openers.

Push when the char is one of ({[.
Pop + match when the char is a closer; return False if the stack is empty or the popped opener doesn't match.
End check: non-empty stack means unmatched openers → return False.
Total time O(N), memory O(N) worst case for deeply nested input.

Worked example: s = "({[]})".

char	action	stack after
`(`	push	`['(']`
`{`	push	`['(', '{']`
`[`	push	`['(', '{', '[']`
`]`	pop, check `[ == [` ✓	`['(', '{']`
`}`	pop, check `{ == {` ✓	`['(']`
`)`	pop, check `( == (` ✓	`[]`

End: stack empty → balanced.

def is_valid(s: str) -> bool:
    pairs = {")": "(", "]": "[", "}": "{"}
    stack: list[str] = []
    for ch in s:
        if ch in "({[":
            stack.append(ch)
        elif ch in ")}]":
            if not stack or stack.pop() != pairs[ch]:
                return False
    return not stack

Pairs dict for multiple bracket types

The pairs dict is the key to generalizing—add new bracket types by adding entries (e.g. "⟩": "⟨"); the algorithm code doesn't change. The invariant: same stack pattern + a configurable pairs dict handles any bracket-like grammar.

openers = set(pairs.values()) — separates push-when from pop-when checks.
elif ch in pairs: — dispatches closers via dict membership; non-bracket characters fall through.
Worked example: the same algorithm with extended dict handles <>, ⟨⟩, math notation; one-line dict change adds new bracket types.

def is_valid(s: str) -> bool:
    pairs = {")": "(", "]": "[", "}": "{"}
    openers = set(pairs.values())
    stack: list[str] = []
    for ch in s:
        if ch in openers:
            stack.append(ch)
        elif ch in pairs:
            if not stack or stack.pop() != pairs[ch]:
                return False
    return not stack

Common beginner mistakes

Forgetting to check empty stack before popping—IndexError on inputs like ).
Using list.pop(0) (FIFO) instead of list.pop() (LIFO)—wrong stack semantics, also O(n).
Hardcoding if ch == "(" and stack[-1] == ")"—doesn't generalize.
Returning True mid-loop on a successful match—you must check end-of-string with empty stack.
Counting opens vs closes without a stack—misses crossed brackets like ([)].

Python interview question on valid brackets

Implement is_valid(s) that returns True if the input string s has properly balanced and nested (), [], {}. Non-bracket characters are not in the input. Use O(n) time.

Solution using a stack and a pairs dict

def is_valid(s: str) -> bool:
    pairs = {")": "(", "]": "[", "}": "{"}
    openers = set(pairs.values())
    stack: list[str] = []
    for ch in s:
        if ch in openers:
            stack.append(ch)
        elif ch in pairs:
            if not stack or stack.pop() != pairs[ch]:
                return False
    return not stack

Step-by-step trace (input: s = "([{}])"):

step	char	action	stack after
0	—	initial	`[]`
1	`(`	opener — push	`['(']`
2	`[`	opener — push	`['(', '[']`
3	`{`	opener — push	`['(', '[', '{']`
4	`}`	closer — pop `{` matches `pairs['}'] = '{'`	`['(', '[']`
5	`]`	closer — pop `[` matches `pairs[']'] = '['`	`['(']`
6	`)`	closer — pop `(` matches `pairs[')'] = '('`	`[]`

Initialize — pairs maps each closer to its opener; openers = {'(', '[', '{'}; stack = [].
Iterate — walk s once, classifying each char as opener, closer, or other.
Push opener — at indexes 0–2, push (, [, { onto the stack.
Pop and check on closer — at index 3 (}), pop {; compare to pairs['}'] = '{' — match, continue.
Continue popping — index 4 pops [ matching pairs[']']; index 5 pops ( matching pairs[')'].
Final check — loop exits, stack is empty; return not stack evaluates not [] = True.
Mismatch case (e.g. ([)]) — at index 2 ()), pop returns [; pairs[')'] = '(', mismatch → return False immediately.

Output:

input	result
`"([{}])"`	`True`
`"([)]"`	`False`
`"(("`	`False`
`")"`	`False`

Why this works — concept by concept:

Stack as LIFO history — pushing every opener captures the nesting order; the most recently opened bracket is the only one a closer can legally close.
Pairs dict as O(1) lookup — pairs[closer] returns the matching opener in constant time, no branching on bracket type.
Empty-stack guard — not stack short-circuits before stack.pop(), catching "closer without opener" (e.g. )) without an IndexError.
Mismatch guard — stack.pop() != pairs[ch] catches crossed brackets (e.g. ([)]) where a closer doesn't match the most-recent opener.
Final emptiness check — return not stack catches unclosed openers (e.g. ((); a non-empty stack at end means dangling openers.
Cost — one pass, O(n) time; stack worst case O(n) memory for fully nested input like (((...))).

PYTHON
Topic — stack
Stack problems

Practice →

PYTHON
Topic — hash table
Hash table problems

Practice →

2. Python Hash Tables for Two-Sum

Hash tables for pair-lookup in Python for data engineering

Given an array of integers and a target sum, return the indices of the two values that add to the target. The naive answer is two nested loops (O(n²)); the clean answer is a single-pass hash-table complement lookup (O(n) time, O(n) space). The mental model: for each element x, the answer is at the index of target - x; build seen = {value: index} as you walk and check the complement before storing.

Drill hash-table problems →

Two-sum brute force vs hash-table O(n)

Brute force checks every pair (i, j) with i < j for arr[i] + arr[j] == target—O(n²), intractable for n = 10^6. The hash rewrite walks once, checks target - x against a running seen dict, and trades O(n) space for the speedup.

Brute force: nested loops, O(n²) time, O(1) space.
Hash single-pass: O(n) time, O(n) space; check before store so seen only contains earlier indices.
Worked example: arr = [2, 7, 11, 15], target = 9.

i	x	complement	seen before	found?	seen after
0	2	7	`{}`	no	`{2: 0}`
1	7	2	`{2: 0}`	yes → return `[0, 1]`	—

def two_sum(nums: list[int], target: int) -> list[int]:
    seen: dict[int, int] = {}
    for i, x in enumerate(nums):
        complement = target - x
        if complement in seen:
            return [seen[complement], i]
        seen[x] = i
    return []

Single-pass complement lookup

The single-pass form is structural: check before store. Reversing the order would let x match itself (the complement could equal x, and seen[x] would point at the same index). The invariant: seen always contains earlier indices; checking complement before storing guarantees we don't match an index against itself.

Check first — if complement in seen: return [...].
Then store — seen[x] = i after the check.
Worked example: arr = [3, 3], target = 6 → at i=0 no match, store {3: 0}; at i=1 complement 3 is in seen → return [0, 1]. Storing first would overwrite to {3: 1} and lose the answer.

Two-pointer alternative on sorted input

When the input is already sorted, place left = 0, right = n - 1 and step inward by comparing arr[left] + arr[right] to target. O(n) time, O(1) space—beats hash on memory but loses if you have to sort first (O(n log n)).

Unsorted input — hash-table single-pass (O(n) time, O(n) space).
Sorted input — two-pointer (O(n) time, O(1) space).
Index caveat: two-pointer returns indices into the sorted array, which may not match original positions if you sorted in place.
Worked example: arr = [2, 7, 11, 15], target = 9.

step	right	sum	action
1	3	17	sum > target, decrement right
2	2	13	sum > target, decrement right
3	1	9	sum == target → return `[0, 1]`

def two_sum_sorted(nums: list[int], target: int) -> list[int]:
    left, right = 0, len(nums) - 1
    while left < right:
        s = nums[left] + nums[right]
        if s == target:
            return [left, right]
        elif s < target:
            left += 1
        else:
            right -= 1
    return []

Common beginner mistakes

Reaching for nested loops (O(n²)) when hash-table single-pass is O(n).
Using list.index() to find the complement—O(n) per lookup, ruining the hash speedup.
Storing before checking—breaks for duplicate-value cases.
Two-pointer on unsorted input—silently wrong.
Returning the values instead of the indices when the prompt asks for indices.

Python interview question on two-sum

Given an array of integers nums and a target integer target, return the indices of the two numbers that add to target. Each input has exactly one solution; you may not use the same element twice. Use O(n) time.

Solution using a hash-table single-pass complement lookup

def two_sum(nums: list[int], target: int) -> list[int]:
    seen: dict[int, int] = {}
    for i, x in enumerate(nums):
        complement = target - x
        if complement in seen:
            return [seen[complement], i]
        seen[x] = i
    return []

Step-by-step trace (input: nums = [2, 7, 11, 15], target = 9):

i	x	complement	seen before	check	action	seen after
0	2	7	`{}`	miss	store `seen[2] = 0`	`{2: 0}`
1	7	2	`{2: 0}`	hit	return `[seen[2], 1]` = `[0, 1]`	—

Initialize — seen = {} (empty dict; will hold value: index pairs).
i = 0, x = 2 — compute complement = 9 - 2 = 7; 7 not in seen; store seen[2] = 0.
i = 1, x = 7 — compute complement = 9 - 7 = 2; 2 in seen → hit.
Return — [seen[2], 1] = [0, 1]. Function exits before storing seen[7].
Self-pair safety — check happens before store, so when x = 7 is being processed, seen does not yet contain index 1 → can't match itself.
No-solution path — if the loop exits without a hit (no two values sum to target), the trailing return [] fires.

Output:

input	result
`nums = [2, 7, 11, 15], target = 9`	`[0, 1]`
`nums = [3, 2, 4], target = 6`	`[1, 2]`
`nums = [1, 2, 3], target = 100`	`[]`

Why this works — concept by concept:

Complement reframing — instead of "find any pair (i, j) with nums[i] + nums[j] == target," reframe as "for each x, is target - x already seen?" — turns O(n²) pair search into O(n) lookup.
Hash map for O(1) membership — seen is a dict keyed by value; Python's average-case O(1) in and [] give constant-time complement lookup.
Check-before-store ordering — if complement in seen runs before seen[x] = i, so an element can't match itself at the same iteration.
Index ordering — when a hit fires, seen[complement] is the earlier index and i is the current index; returning [seen[complement], i] produces the smaller-first ordering most prompts request.
Single-pass guarantee — exactly one walk of nums; no nested loop, no second scan.
Cost — O(n) time (one pass × average O(1) hash ops); O(n) memory worst case for the dict.

PYTHON
Topic — hash table
Hash table problems

Practice →

COMPANY
Oracle — hash table
Oracle-tagged hash table

Practice →

3. Python Dicts for JSON Extraction

JSON parsing and dict navigation in Python for data engineering

Real DE pipelines consume JSON every day: API responses, event-stream payloads, config files, NoSQL documents. The mental model: json.loads converts a JSON string into a nested Python dict/list; navigation is plain Python access. The trap: data["user"]["profile"]["name"] raises KeyError at the first missing level—use .get() chaining or try/except for robustness.

`json.loads` for string-to-dict parsing

json.loads(s) parses a JSON string into Python objects; json.load(f) reads from a file. The reverse pair is json.dumps / json.dump. Errors raise json.JSONDecodeError (a subclass of ValueError).

JSON object {...} → Python dict.
JSON array [...] → Python list.
JSON null / true / false → Python None / True / False.
Double quotes only — json.loads("{'name': 'Alice'}") raises; use ast.literal_eval for Python-style strings.
Worked example: parse {"user": {"name": "Alice", "tags": ["admin"]}}; access data["user"]["name"] → "Alice", data["user"]["tags"][0] → "admin".

import json

raw = '{"user": {"name": "Alice", "age": 30, "tags": ["admin", "active"]}}'
data = json.loads(raw)
print(data["user"]["name"])  # Alice
print(data["user"]["tags"][0])  # admin

Nested-key access with `.get` chaining and defaults

dict["key"] raises KeyError on missing keys; dict.get("key") returns None. For nested paths, chain .get(key, {}) so each intermediate miss returns an empty dict that the next .get can safely call. The invariant: .get chaining never raises on missing keys.

data.get("user", {}).get("profile", {}).get("name") — safe nested access; returns None on any missing level.
Helper for project-wide pattern: a deep_get(d, *keys, default=None) walker.
try/except KeyError — cleaner when missing-key cases are rare and need different defaults.
Worked example: navigate data["user"]["profile"]["name"] when profile may be missing.

structure	`data["user"]["profile"]["name"]`	`.get` chain
`{"user": {"profile": {"name": "Alice"}}}`	`"Alice"`	`"Alice"`
`{"user": {}}`	`KeyError`	`None`
`{}`	`KeyError`	`None`

def get_name(data: dict) -> str | None:
    return data.get("user", {}).get("profile", {}).get("name")

List-comprehension extraction over arrays of records

When the JSON contains an array of records, extract a field per record via a comprehension with safe defaults. Use .get("array", []) so a missing top-level key yields an empty iteration.

[u.get("name", "<unknown>") for u in data.get("users", [])] — the canonical safe extract.
Readability over terseness — name a helper for nested or filtered comprehensions; misread one-liners corrupt downstream data.
Worked example: input {"users": [{"name": "Alice", "id": 1}, {"name": "Bob", "id": 2}, {"id": 3}]} → ["Alice", "Bob", "<unknown>"].

import json

def extract_names(raw: str) -> list[str]:
    data = json.loads(raw)
    return [u.get("name", "<unknown>") for u in data.get("users", [])]

Common beginner mistakes

Using data["key"] for paths that may not exist—KeyError crashes the function.
Forgetting that JSON requires double quotes—single-quoted "JSON" strings raise.
Calling .get then [i]—.get returns None, and None[0] raises TypeError.
Confusing json.loads (string → dict) and json.load (file → dict).
Mutating the parsed dict and writing back without json.dumps—the original raw string is unchanged.

Python interview question on JSON extraction

Given a JSON string representing a list of user records under the key "users", each with optional fields "name", "email", "id", return a list of (id, name) tuples for users that have both fields. Use O(n) time.

Solution using `json.loads` and a list comprehension

import json

def extract_users(raw: str) -> list[tuple[int, str]]:
    data = json.loads(raw)
    return [
        (u["id"], u["name"])
        for u in data.get("users", [])
        if "id" in u and "name" in u
    ]

Step-by-step trace (input: raw = '{"users": [{"id": 1, "name": "Alice"}, {"id": 2}, {"name": "Bob", "id": 3}]}'):

u	`"id" in u`	`"name" in u`	both?	tuple emitted
`{"id": 1, "name": "Alice"}`	yes	yes	yes	`(1, "Alice")`
`{"id": 2}`	yes	no	no	—
`{"name": "Bob", "id": 3}`	yes	yes	yes	`(3, "Bob")`

Parse JSON — json.loads(raw) deserializes the JSON string into a Python dict: {"users": [...]}.
Safe outer access — data.get("users", []) returns the records list; if the "users" key were missing entirely, returns [] and the comprehension iterates zero times instead of raising KeyError.
Filter records — for each user dict u, the if clause "id" in u and "name" in u keeps only records that have both required fields.
Emit tuple — for kept records, (u["id"], u["name"]) builds the result tuple.
Skip incomplete row — {"id": 2} lacks "name", filter is False, no tuple emitted.
Final list — comprehension collects two tuples: [(1, "Alice"), (3, "Bob")].

Output:

id	name
1	Alice
3	Bob

Why this works — concept by concept:

json.loads deserialization — converts a JSON string into native Python types (dict/list/str/int/bool/None); raises JSONDecodeError only on malformed JSON.
.get("users", []) safe default — replaces data["users"] (which raises KeyError on missing key) with an empty list; the comprehension iterates zero times and returns [] cleanly.
Membership filter "id" in u and "name" in u — short-circuits on missing keys before the (u["id"], u["name"]) access; protects against KeyError inside the value expression.
Tuple in comprehension — (u["id"], u["name"]) constructs an immutable record per user; tuples are hashable and ordered, ideal for return values.
Single-pass list comprehension — one walk over the records list, allocating one tuple per kept record; no intermediate state.
Cost — O(n) for parsing the JSON plus one pass over the records; O(n) memory for the result list.

PYTHON
Topic — dictionary
Dictionary problems

Practice →

PYTHON
Topic — JSON
JSON problems

Practice →

4. SQL Aggregation with Joins for Percentage Calculations

Aggregation with joins for percentage calculations in SQL for data engineering

Percentage-of-total calculations are everywhere in DE work: what percentage of users converted, what percentage of restaurants are vegetarian. The mental model: percentage = 100.0 * filtered_count / total_count, computed as conditional aggregation in a single query—numerator via SUM(CASE WHEN ... THEN 1 ELSE 0 END), denominator via COUNT(*).

Pro tip: Multiply by 100.0 (a float literal) to force float division—2 / 3 = 0 in PostgreSQL with integer operands. Wrap the denominator in NULLIF(..., 0) to avoid divide-by-zero.

`COUNT` filtered with `CASE WHEN` for conditional aggregation

SQL aggregates skip NULL, so the canonical "count rows matching predicate" idiom is SUM(CASE WHEN predicate THEN 1 ELSE 0 END) or equivalently COUNT(CASE WHEN predicate THEN 1 END). The invariant: conditional aggregation lets you compute multiple per-group metrics in a single pass without separate queries.

SUM(CASE WHEN ... THEN 1 ELSE 0 END) — portable across all dialects.
COUNT(*) FILTER (WHERE predicate) — cleaner SQL-standard form (PostgreSQL).
Worked example: three restaurants, is_vegetarian = TRUE, FALSE, TRUE.

metric	value
`COUNT(*)`	3
`SUM(CASE WHEN is_vegetarian THEN 1 ELSE 0 END)`	2
`COUNT(*) FILTER (WHERE is_vegetarian)`	2

SELECT
  COUNT(*) AS total,
  SUM(CASE WHEN is_vegetarian THEN 1 ELSE 0 END) AS veg_count
FROM restaurants;

Percentage = `100.0 * filtered_count / total_count`

The percentage formula combines the two aggregates above. Two traps: integer division (force float with 100.0) and divide-by-zero (wrap denominator in NULLIF(COUNT(*), 0)). The invariant: percentage = 100.0 * matching_count / total_count with a NULLIF guard.

100.0 * SUM(...) / COUNT(*) — float literal propagates float division.
NULLIF(COUNT(*), 0) — converts zero to NULL so division returns NULL silently.
ROUND(..., 2) — clamps display precision and removes float noise.
Worked example: three restaurants, two vegetarian → 100.0 * 2 / 3 = 66.67%.

SELECT
  ROUND(
    100.0 * SUM(CASE WHEN is_vegetarian THEN 1 ELSE 0 END) / NULLIF(COUNT(*), 0),
    2
  ) AS veg_percentage
FROM restaurants;

`JOIN` to a dimension before aggregating

Real percentage queries pull attributes from dimension tables: fact has restaurant_id; dim has is_vegetarian, cuisine, city. Join fact to dimension first, then aggregate. The invariant: the grain of the result is determined by the GROUP BY clause, not by the join cardinality—but watch fan-out from many-rows-per-key dimensions.

Three-way join — fact + dim(city) + dim(dietary), then GROUP BY c.city.
Fan-out defense — pre-aggregate the dim to one row per key, or use COUNT(DISTINCT restaurant_id).
Worked example: two cities, three restaurants; per-city percentage after the join + group.

city	total	veg_count	veg_percentage
Denver	2	1	50.00
Seattle	1	1	100.00

SELECT
  c.city,
  ROUND(
    100.0 * SUM(CASE WHEN d.is_vegetarian THEN 1 ELSE 0 END) / NULLIF(COUNT(*), 0),
    2
  ) AS veg_percentage
FROM restaurants r
JOIN cities c ON c.city_id = r.city_id
JOIN dietary d ON d.restaurant_id = r.restaurant_id
GROUP BY c.city
ORDER BY veg_percentage DESC;

Common beginner mistakes

Integer division: 2 / 3 = 0 instead of 0.6667. Multiply by 100.0 (a float) to fix.
Forgetting NULLIF(denominator, 0)—division by zero crashes the query.
Using COUNT(*) after a fan-out join—overcounts; use COUNT(DISTINCT ...) instead.
Computing the numerator and denominator in two separate queries—slower, more verbose.
Not casting result to a fixed precision—percentages display with float noise (e.g. 66.66666666666667).

SQL interview question on percentage with join

Tables restaurants(restaurant_id, city_id), cities(city_id, city), dietary(restaurant_id, is_vegetarian BOOLEAN). Return per-city percentage of vegetarian-only restaurants. Round to 2 decimals; sort by percentage descending.

Solution using conditional aggregation, NULLIF guard, and three-way join

SELECT
  c.city,
  ROUND(
    100.0 * SUM(CASE WHEN d.is_vegetarian THEN 1 ELSE 0 END) / NULLIF(COUNT(*), 0),
    2
  ) AS veg_percentage
FROM restaurants r
JOIN cities c ON c.city_id = r.city_id
JOIN dietary d ON d.restaurant_id = r.restaurant_id
GROUP BY c.city
ORDER BY veg_percentage DESC, c.city ASC;

Step-by-step trace (input: 3 restaurants across 2 cities):

restaurants(restaurant_id, city_id):

restaurant_id	city_id
1	1
2	1
3	2

cities(city_id, city):

city_id	city
1	Denver
2	Seattle

dietary(restaurant_id, is_vegetarian):

restaurant_id	is_vegetarian
1	TRUE
2	FALSE
3	TRUE

Inner three-way join — restaurants r × cities c on city_id × dietary d on restaurant_id. Result: 3 joined rows, each row has city name + is_vegetarian flag.
GROUP BY c.city — collapses to one row per city. Denver has 2 grouped rows; Seattle has 1.
Per-group COUNT(*) — Denver: 2; Seattle: 1.
Per-group conditional sum — SUM(CASE WHEN d.is_vegetarian THEN 1 ELSE 0 END). Denver: TRUE + FALSE → 1. Seattle: TRUE → 1.
Percentage formula per group — Denver: 100.0 * 1 / NULLIF(2, 0) = 50.0. Seattle: 100.0 * 1 / NULLIF(1, 0) = 100.0.
ROUND(..., 2) — Denver → 50.00; Seattle → 100.00.
ORDER BY veg_percentage DESC, c.city ASC — Seattle (100) first, Denver (50) second; the secondary key resolves any ties.

Output:

city	veg_percentage
Seattle	100.00
Denver	50.00

Why this works — concept by concept:

Three-way join attaches dimensions — r ⨝ c provides the city label, r ⨝ d provides the vegetarian flag; the fact table restaurants defines the grain (one row per restaurant) before the group.
Conditional aggregation as numerator — SUM(CASE WHEN d.is_vegetarian THEN 1 ELSE 0 END) counts only vegetarian rows in a single pass alongside the denominator — no separate query.
COUNT(*) as total denominator — counts all joined rows in the group; pairs naturally with the numerator above the same group.
Float-literal multiplier — 100.0 * (rather than 100 *) forces float division so 1 / 2 evaluates as 0.5, not integer 0.
NULLIF(COUNT(*), 0) divide-by-zero guard — converts a zero-row group into NULL, which makes the division yield NULL instead of crashing.
ROUND(..., 2) precision clamp — strips float noise (e.g. 66.66666666666667) so the result reads 66.67.
Deterministic ordering — ORDER BY veg_percentage DESC, c.city ASC sorts top-down, with the city name as a stable tie-breaker.
Cost — one scan per table, hash join on the small dimensions, hash aggregate by city; net O(N) where N = |restaurants|.

SQL
Topic — aggregation
Aggregation problems

Practice →

SQL
Topic — joins
Joins problems

Practice →

5. SQL Self-Joins with Date Functions for Roundtrip Detection

Self-joins with date filters in SQL for data engineering

Roundtrip detection is the canonical self-join interview pattern with a date twist. The mental model: a roundtrip = two flights where the second's origin and destination are swapped versus the first, and the second's date is within the return window. Self-join flights to itself with two aliases (outbound o, inbound i), join on o.dest = i.origin AND o.origin = i.dest, and filter on i.date within N days of o.date.

Self-join basics: alias the same table twice

A self-join uses one physical table as two logical references in the same query. Aliases are required to disambiguate columns; the ON clause defines the relationship. The invariant: aliasing makes intent explicit—state the relationship out loud: "outbound's destination equals inbound's origin and vice versa."

flights AS o, flights AS i — two logical aliases on one physical table.
ON clause carries all relationship constraints; aliases alone do nothing.
Worked example: two flights (101, SFO→JFK) and (102, JFK→SFO).

flight_id	origin	dest	date
101	SFO	JFK	2026-04-01
102	JFK	SFO	2026-04-05

Self-join with o.dest = i.origin AND o.origin = i.dest returns the pair (101, 102).

SELECT o.flight_id AS outbound, i.flight_id AS inbound
FROM flights o
JOIN flights i ON o.dest = i.origin AND o.origin = i.dest;

Roundtrip pattern: origin↔dest swap

The roundtrip-pattern join condition is the swap: o.dest = i.origin AND o.origin = i.dest. Add o.flight_id != i.flight_id to defend against degenerate self-pairings. The invariant: roundtrip = origin/destination swap on the second flight with a self-pair guard.

Two equality constraints, AND-joined—both directions must match.
o.flight_id != i.flight_id — prevents pairing a row with itself (e.g. a circular tour).
Multi-leg trips (A→B→C, C→B→A) require recursive CTEs—out of scope for the simple two-table form.
Worked example: same two flights; swap-condition matches; result = 1 pair.

SELECT o.flight_id AS outbound, i.flight_id AS inbound
FROM flights o
JOIN flights i
  ON o.dest = i.origin
 AND o.origin = i.dest
 AND o.flight_id != i.flight_id;

Date filter with `INTERVAL` for return within N days

A real roundtrip has a date constraint: the inbound flight returns within N days of the outbound. PostgreSQL interval arithmetic (o.date + INTERVAL '7 days') makes this a one-liner; the choice of BETWEEN vs half-open >= / < controls boundary semantics. The invariant: interval arithmetic + date filter constrains the return window.

BETWEEN is inclusive on both ends—use only if both ends are intended.
Half-open >= / < — safer default for "within 7 days" semantics; matches partition arithmetic elsewhere.
Timezone caveat — INTERVAL '7 days' on TIMESTAMP WITH TIME ZONE is 7 * 86400 seconds, not always exactly 7 calendar days under DST. Pure DATE columns are safe.
Worked example: flights 101 (2026-04-01) and 102 (2026-04-05) → 4 days apart, inside a 7-day window → pair (101, 102) returned.

SELECT o.flight_id AS outbound, i.flight_id AS inbound, o.date AS depart, i.date AS return
FROM flights o
JOIN flights i
  ON o.dest = i.origin
 AND o.origin = i.dest
 AND o.flight_id != i.flight_id
WHERE i.date >= o.date
  AND i.date <  o.date + INTERVAL '7 days';

Common beginner mistakes

Forgetting to alias—SELECT origin FROM flights JOIN flights ON ... is ambiguous.
Joining with o.origin = i.origin AND o.dest = i.dest (same direction)—gives same-leg pairs, not roundtrips.
Forgetting o.flight_id != i.flight_id and pairing flights with themselves.
Using BETWEEN when half-open >= / < is the correct convention—off-by-one boundary inclusion.
Computing date arithmetic in the application layer (Python), missing the SQL INTERVAL shortcut.

Practice self-join SQL problems →

SQL interview question on roundtrip flights

Table flights(flight_id, origin, dest, date). Return all (outbound, inbound) flight-id pairs where the inbound trip is the reverse of the outbound and returns within 7 days. Sort by outbound ascending.

Solution using a self-join with origin↔dest swap and date filter

SELECT
  o.flight_id AS outbound,
  i.flight_id AS inbound,
  o.date AS depart_date,
  i.date AS return_date
FROM flights o
JOIN flights i
  ON o.dest = i.origin
 AND o.origin = i.dest
 AND o.flight_id != i.flight_id
WHERE i.date >= o.date
  AND i.date <  o.date + INTERVAL '7 days'
ORDER BY o.flight_id;

Step-by-step trace (input: flights table below):

flight_id	origin	dest	date
101	SFO	JFK	2026-04-01
102	JFK	SFO	2026-04-05
103	LAX	SEA	2026-04-02
104	SEA	LAX	2026-04-15

Alias the table twice — flights o (outbound) and flights i (inbound) reference the same physical rows under two logical names.
Swap-condition match — for each (o, i) candidate, require o.dest = i.origin AND o.origin = i.dest. Pair (101, 102): o.dest=JFK = i.origin=JFK and o.origin=SFO = i.dest=SFO → match. Pair (103, 104): o.dest=SEA = i.origin=SEA and o.origin=LAX = i.dest=LAX → match.
Self-pair guard — o.flight_id != i.flight_id ensures we don't pair flight 101 with itself; both candidate pairs have distinct ids → keep.
Date filter — pair (101, 102) — i.date = 2026-04-05 >= o.date = 2026-04-01 AND 2026-04-05 < 2026-04-01 + 7 days = 2026-04-08 → both true, keep.
Date filter — pair (103, 104) — i.date = 2026-04-15 >= o.date = 2026-04-02 true, but 2026-04-15 < 2026-04-02 + 7 days = 2026-04-09 is false → drop.
ORDER BY o.flight_id — sorts the surviving pair(s) by outbound id ascending; (101, 102) is the only result.

Output:

outbound	inbound	depart_date	return_date
101	102	2026-04-01	2026-04-05

Why this works — concept by concept:

Self-join with two aliases — flights o and flights i give the same physical table two logical roles; aliases are required to disambiguate columns and to express the relationship.
Origin↔destination swap — the equality pair o.dest = i.origin AND o.origin = i.dest encodes "the inbound flight is the reverse of the outbound" — purely structural, no domain knowledge needed.
Self-pair guard — o.flight_id != i.flight_id blocks degenerate pairings of a row with itself (would otherwise match for circular routes where origin = dest).
Half-open date window — i.date >= o.date AND i.date < o.date + INTERVAL '7 days' matches the canonical "within 7 days" semantics — inclusive lower bound, exclusive upper bound — safe from off-by-one boundary bugs.
INTERVAL arithmetic — keeps the date math in SQL rather than the application layer; one expression both portable and readable.
Deterministic ordering — ORDER BY o.flight_id produces stable output, important for tests and downstream diffs.
Cost — index-supported hash join on (origin, dest) plus a filter pass; O(N + M) where N is outbound and M is inbound candidates.

SQL
Topic — self-join
Self-join problems

Practice →

COMPANY
Oracle — date functions
Oracle-tagged date functions

Practice →

6. SQL Window-Function Ranking for Top-K per Group

Top-K-per-group with window ranking in SQL for data engineering

"For each group, return the top-K rows by some metric" is the most-asked SQL window-function pattern. The mental model: PARTITION BY resets the ranking at each group; ORDER BY metric DESC ranks within group top-first; CTE wrapper + WHERE rn <= K keeps the top K. Three independent pieces, one window expression.

`ROW_NUMBER()` vs `RANK()` for top-K

Three ranking functions exist; the choice depends on tie semantics. The decision rule: ROW_NUMBER for "exactly K rows," RANK for "tied for first," DENSE_RANK for "no gaps". Default to ROW_NUMBER with a deterministic tie-break key (e.g. product_id ASC).

ROW_NUMBER() — sequential 1, 2, 3, ... even on ties; deterministic with a tie-break.
RANK() — ties share rank, next rank skips (1, 1, 3).
DENSE_RANK() — ties share rank, no gap (1, 1, 2).
Worked example: shop A has products X (100), Y (80), Z (80), W (50). ROW_NUMBER returns X, Y, Z for top-3; RANK also returns X, Y, Z (Y and Z tied at rank 2).

shop	product	units_sold	row_number	rank
A	X	100	1	1
A	Y	80	2	2
A	Z	80	3	2
A	W	50	4	4

SELECT
  shop_id, product_id, units_sold,
  ROW_NUMBER() OVER (PARTITION BY shop_id ORDER BY units_sold DESC, product_id ASC) AS rn,
  RANK()       OVER (PARTITION BY shop_id ORDER BY units_sold DESC) AS rnk
FROM sales;

`PARTITION BY shop_id ORDER BY units_sold DESC`

PARTITION BY is the per-group reset; each group gets its own ranking sequence starting at 1. ORDER BY units_sold DESC sorts within group, top-first. Always add a deterministic tie-break key—interviewers grade whether you noticed.

PARTITION BY shop_id — resets numbering at each shop (or week, region, category).
ORDER BY units_sold DESC, product_id ASC — primary sort + secondary tie-break.
Worked example: two shops, the window resets at the shop boundary.

shop	product	rn
A	X	1
A	Y	2
A	Z	3
A	W	4
B	M	1
B	N	2
B	O	3

Filtering top-K via CTE wrapper

Standard SQL evaluates WHERE before windows, so you can't filter on rn directly in the same query. Wrap the windowed query in a CTE; filter the CTE outside. The invariant: anything you compute with OVER() requires a CTE wrapper to filter.

CTE pattern — portable across PostgreSQL, MySQL 8+, SQL Server, Oracle.
QUALIFY clause — Snowflake / BigQuery / Teradata shorthand; not portable to PostgreSQL.
Cost — O(N log N) for the partition-aware sort, O(N) for the filter; net O(N log N).
Worked example: filter WHERE rn <= 3 keeps X, Y, Z for shop A and M, N, O for shop B.

WITH ranked AS (
  SELECT
    shop_id, product_id, units_sold,
    ROW_NUMBER() OVER (PARTITION BY shop_id ORDER BY units_sold DESC, product_id ASC) AS rn
  FROM sales
)
SELECT shop_id, product_id, units_sold
FROM ranked
WHERE rn <= 3
ORDER BY shop_id, rn;

Common beginner mistakes

Writing WHERE rn <= 3 directly on a window column—evaluation order makes it invalid.
Using RANK when the prompt wants exactly K rows and ties exist.
Forgetting the tie-break secondary key—non-deterministic ordering for ties.
Forgetting PARTITION BY and ranking globally instead of per-shop.
Filtering with a correlated subquery instead of windows—O(N²) instead of O(N log N).

See more window-function problems →

SQL interview question on top-K ranking

Table sales(shop_id, product_id, units_sold). Return the top 3 products by units_sold for each shop, ties broken by product_id ascending. Sort output by shop_id, then rn.

Solution using `ROW_NUMBER` partitioned by shop

WITH ranked AS (
  SELECT
    shop_id,
    product_id,
    units_sold,
    ROW_NUMBER() OVER (
      PARTITION BY shop_id
      ORDER BY units_sold DESC, product_id ASC
    ) AS rn
  FROM sales
)
SELECT shop_id, product_id, units_sold
FROM ranked
WHERE rn <= 3
ORDER BY shop_id, rn;

Step-by-step trace (input: sales table below):

shop_id	product_id	units_sold
A	X	100
A	Y	80
A	Z	80
A	W	50
B	M	60
B	N	60
B	O	40

Build the ranked CTE — compute ROW_NUMBER() per row using PARTITION BY shop_id ORDER BY units_sold DESC, product_id ASC.
Partition reset at each shop — windowing starts fresh at shop A and again at shop B; rankings are independent per partition.
Apply the order key in shop A — units_sold DESC puts X (100) first; Y and Z tie at 80, broken by product_id ASC (Y before Z); W (50) last → rn = 1, 2, 3, 4.
Apply the order key in shop B — M and N tie at 60, broken by product_id (M before N); O (40) last → rn = 1, 2, 3.
Outer filter WHERE rn <= 3 — keeps shop A rows X (1), Y (2), Z (3); drops W (4). Keeps all three shop B rows (rn 1-3).
ORDER BY shop_id, rn — sorts shop A's three rows first (in rn order), then shop B's three rows.

Output:

shop_id	product_id	units_sold
A	X	100
A	Y	80
A	Z	80
B	M	60
B	N	60
B	O	40

Why this works — concept by concept:

PARTITION BY shop_id for per-group reset — the window restarts numbering at each shop, so the same ROW_NUMBER() expression produces independent rankings per group; no groupwise subqueries needed.
ORDER BY units_sold DESC, product_id ASC — primary key sorts top-first; the secondary product_id ASC produces a stable, deterministic tie-break grading interviewers look for.
ROW_NUMBER over RANK — ROW_NUMBER always emits 1, 2, 3, ... (no ties, no skips), keeping exactly K rows per shop; RANK could emit 1, 2, 2, 4 and break the "top-3 only" guarantee.
CTE wrapper for window filtering — standard SQL evaluates WHERE before OVER(), so WHERE rn <= 3 is invalid in the same query as the window expression; wrapping in WITH ranked AS (...) lets the outer WHERE reference rn.
Outer ORDER BY shop_id, rn — produces clean, deterministic output ordering for downstream consumption (tests, dashboards).
Cost — partition-aware sort is O(N log N) overall (effectively O(N log K) per partition of size K); the outer filter is O(N) — net O(N log N).

SQL
Topic — ranking
Ranking problems

Practice →

SQL
Topic — window functions
Window function problems

Practice →

7. SQL Date Functions for KPI Analysis

Date functions for KPI analysis in SQL for data engineering

KPI analysis is the standard end-of-quarter reporting workload: compute daily, weekly, monthly metrics across a date range. The mental model: bucket timestamps with DATE_TRUNC; group by the bucket; compute multiple metrics per group with conditional aggregation. One query returns all KPIs—no separate query per metric.

`DATE_TRUNC` for KPI bucketing (daily / weekly / monthly)

DATE_TRUNC(unit, ts) truncates a timestamp to the start of the named unit ('day', 'week', 'month', 'quarter'). The invariant: DATE_TRUNC produces a left-edge timestamp for the bucket.

Same pattern across grains — only the unit string changes; aggregation logic stays the same.
Week start — PostgreSQL returns Monday (ISO 8601). Shift to Sunday-start with DATE_TRUNC('week', ts + INTERVAL '1 day') - INTERVAL '1 day'.
Worked example: three rides bucketed to day and hour.

ride_id	ride_start_ts	DATE_TRUNC('day', ts)	DATE_TRUNC('hour', ts)
1	2026-04-28 09:15	2026-04-28 00:00	2026-04-28 09:00
2	2026-04-28 11:30	2026-04-28 00:00	2026-04-28 11:00
3	2026-04-29 02:00	2026-04-29 00:00	2026-04-29 02:00

SELECT
  DATE_TRUNC('day', ride_start_ts) AS day,
  COUNT(*) AS rides
FROM rides
GROUP BY day
ORDER BY day;

`EXTRACT` and `EPOCH` for duration metrics

Duration is a difference between two timestamps; subtraction returns an interval. EXTRACT(EPOCH FROM interval) converts the interval to total seconds as a numeric. The invariant: EXTRACT(EPOCH FROM interval) converts duration to seconds—divide by 60 / 3600 / 86400 for minutes / hours / days.

(end_ts - start_ts) — interval; not directly numeric.
EXTRACT(EPOCH FROM ...) — total seconds as numeric.
AVG(EXTRACT(EPOCH FROM (end - start))) — aggregate duration across rides.
Worked example: average duration per day.

day	avg_duration_minutes
2026-04-28	12.5
2026-04-29	8.0

SELECT
  DATE_TRUNC('day', ride_start_ts) AS day,
  AVG(EXTRACT(EPOCH FROM (ride_end_ts - ride_start_ts)) / 60) AS avg_duration_min
FROM rides
GROUP BY day
ORDER BY day;

Composing multiple KPIs in one query with conditional aggregates

Real KPI dashboards need many metrics in one shot: total rides, average duration, peak-hour share, weekend split. Combine DATE_TRUNC bucketing with conditional aggregation in a single SELECT. The invariant: composing aggregates in a single SELECT reads as a metric definition—adding a new KPI is one new aggregate expression.

COUNT(*) — total rides.
AVG(EXTRACT(EPOCH FROM ...)) — average duration.
SUM(CASE WHEN EXTRACT(HOUR FROM ts) BETWEEN 7 AND 9 THEN 1 ELSE 0 END) — morning-peak count.
SUM(CASE WHEN EXTRACT(DOW FROM ts) IN (0, 6) THEN 1 ELSE 0 END) — weekend count (Sunday=0, Saturday=6).
Worked example: two days, multi-metric output.

day	total_rides	avg_duration_min	morning_peak	weekend_rides
2026-04-28 (Tue)	100	12.5	25	0
2026-05-02 (Sat)	150	8.0	30	150

SELECT
  DATE_TRUNC('day', ride_start_ts) AS day,
  COUNT(*) AS total_rides,
  AVG(EXTRACT(EPOCH FROM (ride_end_ts - ride_start_ts)) / 60) AS avg_duration_min,
  SUM(CASE WHEN EXTRACT(HOUR FROM ride_start_ts) BETWEEN 7 AND 9 THEN 1 ELSE 0 END) AS morning_peak_rides,
  SUM(CASE WHEN EXTRACT(DOW FROM ride_start_ts) IN (0, 6) THEN 1 ELSE 0 END) AS weekend_rides
FROM rides
GROUP BY day
ORDER BY day;

Common beginner mistakes

Using BETWEEN '2026-04-01' AND '2026-04-30' for "all of April" on a timestamp column—over-includes the upper boundary.
Computing each KPI in a separate query—slow, redundant, hard to maintain.
Forgetting GROUP BY after aggregates—syntax error.
EXTRACT(DOW FROM ts) is 0-indexed (Sunday = 0)—off-by-one in weekend filters.
Using INTERVAL '7 day' instead of INTERVAL '7 days'—both work in PostgreSQL but the plural form is canonical.

SQL interview question on KPI analysis

Table rides(ride_id, ride_start_ts, ride_end_ts, fare). Return per-day KPIs: total rides, average ride duration in minutes, total fare, percentage of rides in morning peak (7–9 AM), and weekend ride count. Sort by day.

Solution using `DATE_TRUNC` + `EXTRACT` + conditional aggregation

SELECT
  DATE_TRUNC('day', ride_start_ts) AS day,
  COUNT(*) AS total_rides,
  ROUND(
    AVG(EXTRACT(EPOCH FROM (ride_end_ts - ride_start_ts)) / 60),
    2
  ) AS avg_duration_min,
  SUM(fare) AS total_fare,
  ROUND(
    100.0 * SUM(CASE WHEN EXTRACT(HOUR FROM ride_start_ts) BETWEEN 7 AND 9 THEN 1 ELSE 0 END)
          / NULLIF(COUNT(*), 0),
    2
  ) AS morning_peak_pct,
  SUM(CASE WHEN EXTRACT(DOW FROM ride_start_ts) IN (0, 6) THEN 1 ELSE 0 END) AS weekend_rides
FROM rides
GROUP BY day
ORDER BY day;

Step-by-step trace (input: rides table below — Tue 2026-04-28 and Sat 2026-05-02):

ride_id	ride_start_ts	ride_end_ts	fare
1	2026-04-28 08:00	2026-04-28 08:15	12.00
2	2026-04-28 12:00	2026-04-28 12:10	10.00
3	2026-05-02 09:00	2026-05-02 09:08	8.00
4	2026-05-02 14:00	2026-05-02 14:08	8.00

Bucket by day — DATE_TRUNC('day', ride_start_ts) maps rides 1, 2 → 2026-04-28; rides 3, 4 → 2026-05-02.
GROUP BY day — collapses to one row per day; Tue group has 2 rides, Sat group has 2 rides.
COUNT(*) — Tue: 2; Sat: 2.
Average duration — Tue: avg of 15 + 10 = 12.5 min; Sat: avg of 8 + 8 = 8.0 min. EXTRACT(EPOCH FROM (end - start)) / 60 converts interval to minutes.
SUM(fare) — Tue: 12 + 10 = 22.00; Sat: 8 + 8 = 16.00.
Morning-peak count — EXTRACT(HOUR FROM ride_start_ts) BETWEEN 7 AND 9. Tue ride 1 (08:00) qualifies, ride 2 (12:00) doesn't → count = 1. Sat ride 3 (09:00) qualifies, ride 4 (14:00) doesn't → count = 1.
Morning-peak percentage — Tue: 100.0 * 1 / NULLIF(2, 0) = 50.00. Sat: 100.0 * 1 / NULLIF(2, 0) = 50.00. Rounded to 2 decimals.
Weekend count — EXTRACT(DOW FROM ride_start_ts) IN (0, 6) (Sun=0, Sat=6). Tue (DOW=2): 0 + 0 = 0. Sat (DOW=6): 1 + 1 = 2.
ORDER BY day — sorts ascending: Tue first, Sat second.

Output:

day	total_rides	avg_duration_min	total_fare	morning_peak_pct	weekend_rides
2026-04-28	2	12.50	22.00	50.00	0
2026-05-02	2	8.00	16.00	50.00	2

Why this works — concept by concept:

DATE_TRUNC('day', ts) for bucketing — produces a left-edge timestamp per ride; GROUP BY on this expression collapses all rides on the same calendar day into one group.
Composing aggregates in one SELECT — every KPI is a single aggregate expression evaluated in the same pass; adding a new metric is one extra column, not a new query.
EXTRACT(EPOCH FROM interval) for duration — converts the interval returned by (end - start) to total seconds as a numeric; dividing by 60 yields minutes, the readable display unit.
Conditional aggregation for morning peak — SUM(CASE WHEN EXTRACT(HOUR ...) BETWEEN 7 AND 9 THEN 1 ELSE 0 END) counts only the qualifying rides in the same group pass — no separate query needed.
100.0 * float coercion — forces float division so 1 / 2 produces 0.5, not integer 0; NULLIF(COUNT(*), 0) guards empty buckets from divide-by-zero.
EXTRACT(DOW FROM ts) IN (0, 6) for weekends — Sunday is 0, Saturday is 6 in PostgreSQL — the off-by-one trap interviewers ask about; one expression covers both weekend days.
ROUND(..., 2) for clean display — strips float noise from the average and percentage outputs.
Cost — one scan of rides, hash aggregate by truncated day, then a sort; O(N log K) where N is rides and K is distinct days.

SQL
Topic — date functions
Date function problems

Practice →

COMPANY
Oracle — date functions
Oracle-tagged date functions

Practice →

Tips to crack Oracle data engineering interviews

These are habits that move the needle in real Oracle loops—not a re-statement of the topics above.

Python preparation

Drill the three patterns in Oracle's Python set: stack for valid brackets, hash table for two-sum, dict / JSON for extraction. Stdlib fluency (json, dict.get, list as stack, dict as hash map) matters more than algorithm depth. Drill problems from the stack, hash table, and dictionary topic pages.

SQL preparation

Drill the four patterns: aggregation with joins for percentages, self-joins with date filters for roundtrips, window-function ranking for top-K, and date functions for KPIs. Memorize the 100.0 * filtered / total percentage formula, the origin↔dest swap for self-joins, the ROW_NUMBER OVER (PARTITION BY ... ORDER BY ... DESC) template, and DATE_TRUNC + EXTRACT for KPIs. The aggregation, self-join, ranking, and date functions topic pages cover the bulk.

Domain-agnostic framing

Oracle's prompts span four verticals (food, flights, retail, mobility). The interviewer is grading whether you map each framing to the right algorithm regardless of domain. State the mapping out loud: "this is conditional aggregation"; "this is a self-join with a date filter"; "this is top-K via window ranking"; "this is multi-KPI date-bucket aggregation."

Where to practice on PipeCode

Skill lane	Practice path
Curated Oracle practice set	/explore/practice/company/oracle
Stack	/explore/practice/topic/stack
Hash table	/explore/practice/topic/hash-table
Dictionary	/explore/practice/topic/dictionary
Aggregation	/explore/practice/topic/aggregation
Self-join	/explore/practice/topic/self-join
Ranking	/explore/practice/topic/ranking
Date functions	/explore/practice/topic/date-functions
All practice topics	/explore/practice/topics
Interview courses	/explore/courses

Communication under time pressure

State assumptions before typing: "I'll assume flights has unique flight_id"; "I'll assume timestamps are in a single timezone"; "I'll assume rides.fare is non-null." State invariants after key code blocks. State complexity: "this is O(N log N) for the partitioned sort + O(N) for the filter." Interviewers grade clear reasoning above silent-and-perfect.

Frequently Asked Questions

What is the Oracle data engineering interview process like?

The Oracle data engineering interview typically includes a phone screen (mostly Python warm-up around stack / hash table / dict patterns), one or two coding rounds focused on SQL aggregation, joins, and window functions, plus a system-design conversation around pipelines and data architectures, and behavioral interviews. The curated 7-problem Oracle practice set on PipeCode mirrors what you will see on the technical rounds.

What SQL topics does Oracle test for data engineers?

Oracle emphasizes conditional aggregation with joins (for percentages), self-joins with date filters (for relationship detection like roundtrips), window-function ranking (for top-K-per-group), and date functions (DATE_TRUNC + EXTRACT) for KPI analysis. Drill these on the aggregation, self-join, ranking, and date functions topic pages.

How important is Python for an Oracle data engineering interview?

Python is roughly half the technical interview at Oracle. The patterns are classics: stack for bracket validation, hash table for two-sum, dict / JSON for extraction. Stdlib fluency (list, dict, json) matters more than algorithm depth.

Are Oracle's data engineering questions about Oracle Database specifically?

No—despite the company name, the curated practice problems are dialect-agnostic SQL (typically PostgreSQL syntax) and standard Python. There's no Oracle-Database-specific PL/SQL or proprietary feature in the curated set. Prepare for PostgreSQL-flavor SQL; the patterns translate to Oracle Database SQL with minor syntax adjustments (e.g. NVL instead of COALESCE).

How hard are Oracle data engineering interview questions?

Oracle's curated set has 4 easy + 3 medium + 0 hard = the most fundamentals-friendly hub of the seven covered so far. No exotic algorithms (no Morris traversal, no DP, no advanced graph algos)—just classic interview patterns. Good entry hub for early-career candidates and quick refresher for senior candidates.

How many Oracle practice problems should I solve before the interview?

Aim for 20–30 problems spanning all seven topic clusters above—not 100 of the same kind. Solve every problem in the Oracle-tagged practice set, then back-fill weak areas using the topic pages linked throughout this guide.

Start practicing Oracle data engineering problems

Reading patterns is not the same as typing them under time pressure. PipeCode pairs company-tagged Oracle problems with tests, AI feedback, and a coding environment so you can drill the exact SQL and Python patterns Oracle asks—without the noise of generic algorithm prep that doesn't apply to this loop.

Pipecode.ai is Leetcode for Data Engineering.

Browse Oracle practice →
View all practice →

Top Oracle data engineering interview topics

1. Python Stacks for Valid Brackets

Stacks for matching brackets in Python for data engineering

Stack basics: list.append / list.pop

Push openers, pop on closers, validate match

Pairs dict for multiple bracket types

Python interview question on valid brackets

Solution using a stack and a pairs dict

2. Python Hash Tables for Two-Sum

Hash tables for pair-lookup in Python for data engineering

Two-sum brute force vs hash-table O(n)

Single-pass complement lookup

Two-pointer alternative on sorted input

Python interview question on two-sum

Solution using a hash-table single-pass complement lookup

3. Python Dicts for JSON Extraction

JSON parsing and dict navigation in Python for data engineering

json.loads for string-to-dict parsing

Nested-key access with .get chaining and defaults

List-comprehension extraction over arrays of records

Python interview question on JSON extraction

Solution using json.loads and a list comprehension

4. SQL Aggregation with Joins for Percentage Calculations

Aggregation with joins for percentage calculations in SQL for data engineering

COUNT filtered with CASE WHEN for conditional aggregation

Percentage = 100.0 * filtered_count / total_count

JOIN to a dimension before aggregating

SQL interview question on percentage with join

Solution using conditional aggregation, NULLIF guard, and three-way join

5. SQL Self-Joins with Date Functions for Roundtrip Detection

Self-joins with date filters in SQL for data engineering

Self-join basics: alias the same table twice

Roundtrip pattern: origin↔dest swap

Date filter with INTERVAL for return within N days

SQL interview question on roundtrip flights

Solution using a self-join with origin↔dest swap and date filter

6. SQL Window-Function Ranking for Top-K per Group

Top-K-per-group with window ranking in SQL for data engineering

ROW_NUMBER() vs RANK() for top-K

PARTITION BY shop_id ORDER BY units_sold DESC

Filtering top-K via CTE wrapper

SQL interview question on top-K ranking

Solution using ROW_NUMBER partitioned by shop

7. SQL Date Functions for KPI Analysis

Date functions for KPI analysis in SQL for data engineering

DATE_TRUNC for KPI bucketing (daily / weekly / monthly)

EXTRACT and EPOCH for duration metrics

Composing multiple KPIs in one query with conditional aggregates

SQL interview question on KPI analysis

Solution using DATE_TRUNC + EXTRACT + conditional aggregation

Tips to crack Oracle data engineering interviews

Python preparation

SQL preparation

Domain-agnostic framing

Where to practice on PipeCode

Communication under time pressure

Frequently Asked Questions

What is the Oracle data engineering interview process like?

What SQL topics does Oracle test for data engineers?

How important is Python for an Oracle data engineering interview?

Are Oracle's data engineering questions about Oracle Database specifically?

How hard are Oracle data engineering interview questions?

How many Oracle practice problems should I solve before the interview?

Start practicing Oracle data engineering problems

Stack basics: `list.append` / `list.pop`

`json.loads` for string-to-dict parsing

Nested-key access with `.get` chaining and defaults

Solution using `json.loads` and a list comprehension

`COUNT` filtered with `CASE WHEN` for conditional aggregation

Percentage = `100.0 * filtered_count / total_count`

`JOIN` to a dimension before aggregating

Date filter with `INTERVAL` for return within N days

`ROW_NUMBER()` vs `RANK()` for top-K

`PARTITION BY shop_id ORDER BY units_sold DESC`

Solution using `ROW_NUMBER` partitioned by shop

`DATE_TRUNC` for KPI bucketing (daily / weekly / monthly)

`EXTRACT` and `EPOCH` for duration metrics

Solution using `DATE_TRUNC` + `EXTRACT` + conditional aggregation