Oracle data engineering interview questions are a balanced fundamentals + classics tour: three Python primitives (stack-based bracket validation, hash-table two-sum, dict / JSON extraction) and four SQL primitives (aggregation with joins for percentage calculations, self-joins with date functions for roundtrip detection, window-function ranking for top-K per group, and date functions for KPI analysis). The framings span restaurants, flights, coffee shops, and ride-sharing—domain-agnostic interview classics, not Oracle-Database trivia.
This guide walks through the seven topic clusters Oracle actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 7-problem Oracle set (4 easy, 3 medium, 0 hard)—the most fundamentals-friendly company hub of the seven covered so far, no Hard tier and a balanced SQL/Python ratio.
Top Oracle data engineering interview topics
From the Oracle data engineering practice set, the seven numbered sections below follow this topic map (one row per H2):
| # | Topic (sections 1–7) | Why it shows up at Oracle |
|---|---|---|
| 1 | Python stacks for valid brackets | Valid Brackets—push openers, pop on closers, validate the match. |
| 2 | Python hash tables for two-sum | Two Sum—single-pass complement lookup, O(n). |
| 3 | Python dicts for JSON extraction | JSON User Name Extraction—json.loads plus chained dict access. |
| 4 | SQL aggregation with joins for percentage calculations | Percentage of Vegetarian-Only Restaurants—COUNT filtered with CASE WHEN, divide by total. |
| 5 | SQL self-joins with date functions for roundtrip detection | Roundtrip Flights Query—self-join with origin↔dest swap and date filter. |
| 6 | SQL window-function ranking for top-K per group | Coffee Shop Top 3 Products—ROW_NUMBER() partitioned by shop, filter to top 3. |
| 7 | SQL date functions for KPI analysis | Ride-Sharing KPI Analysis—DATE_TRUNC for time bucketing, EXTRACT for duration. |
Domain-agnostic framing rule: Oracle's prompts span four different verticals (food, flights, retail, mobility). The interviewer is grading whether you map each framing to the right algorithm: brackets → stack; pair lookup → hash; JSON → dict navigation; percentage → conditional aggregation; roundtrip → self-join + date filter; top-K → window ranking; KPI → date functions. State the mapping out loud.
1. Python Stacks for Valid Brackets
Stacks for matching brackets in Python for data engineering
Walk a string of brackets; push openers, pop on closers, verify match; return True if balanced. The mental model: a stack tracks "what we expect to close next." Each opener pushes its expectation; each closer pops the most-recent expectation and verifies it matches. A naive count-of-each-bracket fails on crossed inputs like ([)]—the stack pattern catches them.
Pro tip: Use a Python
list(append/popat the right end) as the stack—bothO(1)amortized.list.pop(0)isO(n)and silently wrong for LIFO semantics.
Stack basics: list.append / list.pop
A Python list is the canonical LIFO stack. The invariant: the stack at any moment contains the un-closed openers in order; the top is the most recent unclosed opener.
-
stack.append(x)— push onto the right end,O(1)amortized. -
stack.pop()— remove and return the rightmost,O(1). -
stack[-1]— peek without popping. -
if not stack— empty check before popping (otherwiseIndexError). -
Worked example: push
(,[,{→ stack is['(', '[', '{']; onepop()returns'{'and leaves['(', '['].
Push openers, pop on closers, validate match
The full algorithm: define a pairs map from closer to its matching opener, walk the string, push openers, and on each closer pop+verify. The invariant: at any point, the stack reflects the chain of unclosed openers.
-
Push when the char is one of
({[. -
Pop + match when the char is a closer; return
Falseif the stack is empty or the popped opener doesn't match. -
End check: non-empty stack means unmatched openers → return
False. - Total time
O(N), memoryO(N)worst case for deeply nested input.
Worked example: s = "({[]})".
| char | action | stack after |
|---|---|---|
( |
push | ['('] |
{ |
push | ['(', '{'] |
[ |
push | ['(', '{', '['] |
] |
pop, check [ == [ ✓ |
['(', '{'] |
} |
pop, check { == { ✓ |
['('] |
) |
pop, check ( == ( ✓ |
[] |
End: stack empty → balanced.
def is_valid(s: str) -> bool:
pairs = {")": "(", "]": "[", "}": "{"}
stack: list[str] = []
for ch in s:
if ch in "({[":
stack.append(ch)
elif ch in ")}]":
if not stack or stack.pop() != pairs[ch]:
return False
return not stack
Pairs dict for multiple bracket types
The pairs dict is the key to generalizing—add new bracket types by adding entries (e.g. "⟩": "⟨"); the algorithm code doesn't change. The invariant: same stack pattern + a configurable pairs dict handles any bracket-like grammar.
-
openers = set(pairs.values())— separates push-when from pop-when checks. -
elif ch in pairs:— dispatches closers via dict membership; non-bracket characters fall through. -
Worked example: the same algorithm with extended dict handles
<>,⟨⟩, math notation; one-line dict change adds new bracket types.
def is_valid(s: str) -> bool:
pairs = {")": "(", "]": "[", "}": "{"}
openers = set(pairs.values())
stack: list[str] = []
for ch in s:
if ch in openers:
stack.append(ch)
elif ch in pairs:
if not stack or stack.pop() != pairs[ch]:
return False
return not stack
Common beginner mistakes
- Forgetting to check empty stack before popping—
IndexErroron inputs like). - Using
list.pop(0)(FIFO) instead oflist.pop()(LIFO)—wrong stack semantics, also O(n). - Hardcoding
if ch == "(" and stack[-1] == ")"—doesn't generalize. - Returning
Truemid-loop on a successful match—you must check end-of-string with empty stack. - Counting opens vs closes without a stack—misses crossed brackets like
([)].
Python interview question on valid brackets
Implement is_valid(s) that returns True if the input string s has properly balanced and nested (), [], {}. Non-bracket characters are not in the input. Use O(n) time.
Solution using a stack and a pairs dict
def is_valid(s: str) -> bool:
pairs = {")": "(", "]": "[", "}": "{"}
openers = set(pairs.values())
stack: list[str] = []
for ch in s:
if ch in openers:
stack.append(ch)
elif ch in pairs:
if not stack or stack.pop() != pairs[ch]:
return False
return not stack
Step-by-step trace (input: s = "([{}])"):
| step | char | action | stack after |
|---|---|---|---|
| 0 | — | initial | [] |
| 1 | ( |
opener — push | ['('] |
| 2 | [ |
opener — push | ['(', '['] |
| 3 | { |
opener — push | ['(', '[', '{'] |
| 4 | } |
closer — pop { matches pairs['}'] = '{'
|
['(', '['] |
| 5 | ] |
closer — pop [ matches pairs[']'] = '['
|
['('] |
| 6 | ) |
closer — pop ( matches pairs[')'] = '('
|
[] |
-
Initialize —
pairsmaps each closer to its opener;openers = {'(', '[', '{'};stack = []. -
Iterate — walk
sonce, classifying each char as opener, closer, or other. -
Push opener — at indexes 0–2, push
(,[,{onto the stack. -
Pop and check on closer — at index 3 (
}), pop{; compare topairs['}'] = '{'— match, continue. -
Continue popping — index 4 pops
[matchingpairs[']']; index 5 pops(matchingpairs[')']. -
Final check — loop exits,
stackis empty;return not stackevaluatesnot [] = True. -
Mismatch case (e.g.
([)]) — at index 2 ()), pop returns[;pairs[')'] = '(', mismatch → returnFalseimmediately.
Output:
| input | result |
|---|---|
"([{}])" |
True |
"([)]" |
False |
"((" |
False |
")" |
False |
Why this works — concept by concept:
- Stack as LIFO history — pushing every opener captures the nesting order; the most recently opened bracket is the only one a closer can legally close.
-
Pairs dict as O(1) lookup —
pairs[closer]returns the matching opener in constant time, no branching on bracket type. -
Empty-stack guard —
not stackshort-circuits beforestack.pop(), catching "closer without opener" (e.g.)) without anIndexError. -
Mismatch guard —
stack.pop() != pairs[ch]catches crossed brackets (e.g.([)]) where a closer doesn't match the most-recent opener. -
Final emptiness check —
return not stackcatches unclosed openers (e.g.((); a non-empty stack at end means dangling openers. -
Cost — one pass,
O(n)time; stack worst caseO(n)memory for fully nested input like(((...))).
PYTHON
Topic — stack
Stack problems
PYTHON
Topic — hash table
Hash table problems
2. Python Hash Tables for Two-Sum
Hash tables for pair-lookup in Python for data engineering
Given an array of integers and a target sum, return the indices of the two values that add to the target. The naive answer is two nested loops (O(n²)); the clean answer is a single-pass hash-table complement lookup (O(n) time, O(n) space). The mental model: for each element x, the answer is at the index of target - x; build seen = {value: index} as you walk and check the complement before storing.
Two-sum brute force vs hash-table O(n)
Brute force checks every pair (i, j) with i < j for arr[i] + arr[j] == target—O(n²), intractable for n = 10^6. The hash rewrite walks once, checks target - x against a running seen dict, and trades O(n) space for the speedup.
-
Brute force: nested loops,
O(n²)time,O(1)space. -
Hash single-pass:
O(n)time,O(n)space; check before store soseenonly contains earlier indices. -
Worked example:
arr = [2, 7, 11, 15],target = 9.
| i | x | complement | seen before | found? | seen after |
|---|---|---|---|---|---|
| 0 | 2 | 7 | {} |
no | {2: 0} |
| 1 | 7 | 2 | {2: 0} |
yes → return [0, 1]
|
— |
def two_sum(nums: list[int], target: int) -> list[int]:
seen: dict[int, int] = {}
for i, x in enumerate(nums):
complement = target - x
if complement in seen:
return [seen[complement], i]
seen[x] = i
return []
Single-pass complement lookup
The single-pass form is structural: check before store. Reversing the order would let x match itself (the complement could equal x, and seen[x] would point at the same index). The invariant: seen always contains earlier indices; checking complement before storing guarantees we don't match an index against itself.
-
Check first —
if complement in seen: return [...]. -
Then store —
seen[x] = iafter the check. -
Worked example:
arr = [3, 3],target = 6→ ati=0no match, store{3: 0}; ati=1complement3is inseen→ return[0, 1]. Storing first would overwrite to{3: 1}and lose the answer.
Two-pointer alternative on sorted input
When the input is already sorted, place left = 0, right = n - 1 and step inward by comparing arr[left] + arr[right] to target. O(n) time, O(1) space—beats hash on memory but loses if you have to sort first (O(n log n)).
-
Unsorted input — hash-table single-pass (
O(n)time,O(n)space). -
Sorted input — two-pointer (
O(n)time,O(1)space). - Index caveat: two-pointer returns indices into the sorted array, which may not match original positions if you sorted in place.
-
Worked example:
arr = [2, 7, 11, 15],target = 9.
| step | left | right | sum | action |
|---|---|---|---|---|
| 1 | 0 | 3 | 17 | sum > target, decrement right |
| 2 | 0 | 2 | 13 | sum > target, decrement right |
| 3 | 0 | 1 | 9 | sum == target → return [0, 1]
|
def two_sum_sorted(nums: list[int], target: int) -> list[int]:
left, right = 0, len(nums) - 1
while left < right:
s = nums[left] + nums[right]
if s == target:
return [left, right]
elif s < target:
left += 1
else:
right -= 1
return []
Common beginner mistakes
- Reaching for nested loops (
O(n²)) when hash-table single-pass isO(n). - Using
list.index()to find the complement—O(n)per lookup, ruining the hash speedup. - Storing before checking—breaks for duplicate-value cases.
- Two-pointer on unsorted input—silently wrong.
- Returning the values instead of the indices when the prompt asks for indices.
Python interview question on two-sum
Given an array of integers nums and a target integer target, return the indices of the two numbers that add to target. Each input has exactly one solution; you may not use the same element twice. Use O(n) time.
Solution using a hash-table single-pass complement lookup
def two_sum(nums: list[int], target: int) -> list[int]:
seen: dict[int, int] = {}
for i, x in enumerate(nums):
complement = target - x
if complement in seen:
return [seen[complement], i]
seen[x] = i
return []
Step-by-step trace (input: nums = [2, 7, 11, 15], target = 9):
| i | x | complement | seen before | check | action | seen after |
|---|---|---|---|---|---|---|
| 0 | 2 | 7 | {} |
miss | store seen[2] = 0
|
{2: 0} |
| 1 | 7 | 2 | {2: 0} |
hit | return [seen[2], 1] = [0, 1]
|
— |
-
Initialize —
seen = {}(empty dict; will holdvalue: indexpairs). -
i = 0, x = 2 — compute
complement = 9 - 2 = 7;7 not in seen; storeseen[2] = 0. -
i = 1, x = 7 — compute
complement = 9 - 7 = 2;2 in seen→ hit. -
Return —
[seen[2], 1]=[0, 1]. Function exits before storingseen[7]. -
Self-pair safety — check happens before store, so when
x = 7is being processed,seendoes not yet contain index 1 → can't match itself. -
No-solution path — if the loop exits without a hit (no two values sum to target), the trailing
return []fires.
Output:
| input | result |
|---|---|
nums = [2, 7, 11, 15], target = 9 |
[0, 1] |
nums = [3, 2, 4], target = 6 |
[1, 2] |
nums = [1, 2, 3], target = 100 |
[] |
Why this works — concept by concept:
-
Complement reframing — instead of "find any pair
(i, j)withnums[i] + nums[j] == target," reframe as "for eachx, istarget - xalready seen?" — turnsO(n²)pair search intoO(n)lookup. -
Hash map for O(1) membership —
seenis a dict keyed by value; Python's average-caseO(1)inand[]give constant-time complement lookup. -
Check-before-store ordering —
if complement in seenruns beforeseen[x] = i, so an element can't match itself at the same iteration. -
Index ordering — when a hit fires,
seen[complement]is the earlier index andiis the current index; returning[seen[complement], i]produces the smaller-first ordering most prompts request. -
Single-pass guarantee — exactly one walk of
nums; no nested loop, no second scan. -
Cost —
O(n)time (one pass × averageO(1)hash ops);O(n)memory worst case for the dict.
PYTHON
Topic — hash table
Hash table problems
COMPANY
Oracle — hash table
Oracle-tagged hash table
3. Python Dicts for JSON Extraction
JSON parsing and dict navigation in Python for data engineering
Real DE pipelines consume JSON every day: API responses, event-stream payloads, config files, NoSQL documents. The mental model: json.loads converts a JSON string into a nested Python dict/list; navigation is plain Python access. The trap: data["user"]["profile"]["name"] raises KeyError at the first missing level—use .get() chaining or try/except for robustness.
json.loads for string-to-dict parsing
json.loads(s) parses a JSON string into Python objects; json.load(f) reads from a file. The reverse pair is json.dumps / json.dump. Errors raise json.JSONDecodeError (a subclass of ValueError).
-
JSON object
{...}→ Pythondict. -
JSON array
[...]→ Pythonlist. -
JSON
null/true/false→ PythonNone/True/False. -
Double quotes only —
json.loads("{'name': 'Alice'}")raises; useast.literal_evalfor Python-style strings. -
Worked example: parse
{"user": {"name": "Alice", "tags": ["admin"]}}; accessdata["user"]["name"]→"Alice",data["user"]["tags"][0]→"admin".
import json
raw = '{"user": {"name": "Alice", "age": 30, "tags": ["admin", "active"]}}'
data = json.loads(raw)
print(data["user"]["name"]) # Alice
print(data["user"]["tags"][0]) # admin
Nested-key access with .get chaining and defaults
dict["key"] raises KeyError on missing keys; dict.get("key") returns None. For nested paths, chain .get(key, {}) so each intermediate miss returns an empty dict that the next .get can safely call. The invariant: .get chaining never raises on missing keys.
-
data.get("user", {}).get("profile", {}).get("name")— safe nested access; returnsNoneon any missing level. -
Helper for project-wide pattern: a
deep_get(d, *keys, default=None)walker. -
try/except KeyError— cleaner when missing-key cases are rare and need different defaults. -
Worked example: navigate
data["user"]["profile"]["name"]whenprofilemay be missing.
| structure | data["user"]["profile"]["name"] |
.get chain |
|---|---|---|
{"user": {"profile": {"name": "Alice"}}} |
"Alice" |
"Alice" |
{"user": {}} |
KeyError |
None |
{} |
KeyError |
None |
def get_name(data: dict) -> str | None:
return data.get("user", {}).get("profile", {}).get("name")
List-comprehension extraction over arrays of records
When the JSON contains an array of records, extract a field per record via a comprehension with safe defaults. Use .get("array", []) so a missing top-level key yields an empty iteration.
-
[u.get("name", "<unknown>") for u in data.get("users", [])]— the canonical safe extract. - Readability over terseness — name a helper for nested or filtered comprehensions; misread one-liners corrupt downstream data.
-
Worked example: input
{"users": [{"name": "Alice", "id": 1}, {"name": "Bob", "id": 2}, {"id": 3}]}→["Alice", "Bob", "<unknown>"].
import json
def extract_names(raw: str) -> list[str]:
data = json.loads(raw)
return [u.get("name", "<unknown>") for u in data.get("users", [])]
Common beginner mistakes
- Using
data["key"]for paths that may not exist—KeyErrorcrashes the function. - Forgetting that JSON requires double quotes—single-quoted "JSON" strings raise.
- Calling
.getthen[i]—.getreturnsNone, andNone[0]raisesTypeError. - Confusing
json.loads(string → dict) andjson.load(file → dict). - Mutating the parsed dict and writing back without
json.dumps—the original raw string is unchanged.
Python interview question on JSON extraction
Given a JSON string representing a list of user records under the key "users", each with optional fields "name", "email", "id", return a list of (id, name) tuples for users that have both fields. Use O(n) time.
Solution using json.loads and a list comprehension
import json
def extract_users(raw: str) -> list[tuple[int, str]]:
data = json.loads(raw)
return [
(u["id"], u["name"])
for u in data.get("users", [])
if "id" in u and "name" in u
]
Step-by-step trace (input: raw = '{"users": [{"id": 1, "name": "Alice"}, {"id": 2}, {"name": "Bob", "id": 3}]}'):
| u | "id" in u |
"name" in u |
both? | tuple emitted |
|---|---|---|---|---|
{"id": 1, "name": "Alice"} |
yes | yes | yes | (1, "Alice") |
{"id": 2} |
yes | no | no | — |
{"name": "Bob", "id": 3} |
yes | yes | yes | (3, "Bob") |
-
Parse JSON —
json.loads(raw)deserializes the JSON string into a Pythondict:{"users": [...]}. -
Safe outer access —
data.get("users", [])returns the records list; if the"users"key were missing entirely, returns[]and the comprehension iterates zero times instead of raisingKeyError. -
Filter records — for each user dict
u, theifclause"id" in u and "name" in ukeeps only records that have both required fields. -
Emit tuple — for kept records,
(u["id"], u["name"])builds the result tuple. -
Skip incomplete row —
{"id": 2}lacks"name", filter isFalse, no tuple emitted. -
Final list — comprehension collects two tuples:
[(1, "Alice"), (3, "Bob")].
Output:
| id | name |
|---|---|
| 1 | Alice |
| 3 | Bob |
Why this works — concept by concept:
-
json.loadsdeserialization — converts a JSON string into native Python types (dict/list/str/int/bool/None); raisesJSONDecodeErroronly on malformed JSON. -
.get("users", [])safe default — replacesdata["users"](which raisesKeyErroron missing key) with an empty list; the comprehension iterates zero times and returns[]cleanly. -
Membership filter
"id" in u and "name" in u— short-circuits on missing keys before the(u["id"], u["name"])access; protects againstKeyErrorinside the value expression. -
Tuple in comprehension —
(u["id"], u["name"])constructs an immutable record per user; tuples are hashable and ordered, ideal for return values. - Single-pass list comprehension — one walk over the records list, allocating one tuple per kept record; no intermediate state.
-
Cost —
O(n)for parsing the JSON plus one pass over the records;O(n)memory for the result list.
PYTHON
Topic — dictionary
Dictionary problems
PYTHON
Topic — JSON
JSON problems
4. SQL Aggregation with Joins for Percentage Calculations
Aggregation with joins for percentage calculations in SQL for data engineering
Percentage-of-total calculations are everywhere in DE work: what percentage of users converted, what percentage of restaurants are vegetarian. The mental model: percentage = 100.0 * filtered_count / total_count, computed as conditional aggregation in a single query—numerator via SUM(CASE WHEN ... THEN 1 ELSE 0 END), denominator via COUNT(*).
Pro tip: Multiply by
100.0(a float literal) to force float division—2 / 3 = 0in PostgreSQL with integer operands. Wrap the denominator inNULLIF(..., 0)to avoid divide-by-zero.
COUNT filtered with CASE WHEN for conditional aggregation
SQL aggregates skip NULL, so the canonical "count rows matching predicate" idiom is SUM(CASE WHEN predicate THEN 1 ELSE 0 END) or equivalently COUNT(CASE WHEN predicate THEN 1 END). The invariant: conditional aggregation lets you compute multiple per-group metrics in a single pass without separate queries.
-
SUM(CASE WHEN ... THEN 1 ELSE 0 END)— portable across all dialects. -
COUNT(*) FILTER (WHERE predicate)— cleaner SQL-standard form (PostgreSQL). -
Worked example: three restaurants,
is_vegetarian = TRUE, FALSE, TRUE.
| metric | value |
|---|---|
COUNT(*) |
3 |
SUM(CASE WHEN is_vegetarian THEN 1 ELSE 0 END) |
2 |
COUNT(*) FILTER (WHERE is_vegetarian) |
2 |
SELECT
COUNT(*) AS total,
SUM(CASE WHEN is_vegetarian THEN 1 ELSE 0 END) AS veg_count
FROM restaurants;
Percentage = 100.0 * filtered_count / total_count
The percentage formula combines the two aggregates above. Two traps: integer division (force float with 100.0) and divide-by-zero (wrap denominator in NULLIF(COUNT(*), 0)). The invariant: percentage = 100.0 * matching_count / total_count with a NULLIF guard.
-
100.0 * SUM(...) / COUNT(*)— float literal propagates float division. -
NULLIF(COUNT(*), 0)— converts zero to NULL so division returns NULL silently. -
ROUND(..., 2)— clamps display precision and removes float noise. -
Worked example: three restaurants, two vegetarian →
100.0 * 2 / 3 = 66.67%.
SELECT
ROUND(
100.0 * SUM(CASE WHEN is_vegetarian THEN 1 ELSE 0 END) / NULLIF(COUNT(*), 0),
2
) AS veg_percentage
FROM restaurants;
JOIN to a dimension before aggregating
Real percentage queries pull attributes from dimension tables: fact has restaurant_id; dim has is_vegetarian, cuisine, city. Join fact to dimension first, then aggregate. The invariant: the grain of the result is determined by the GROUP BY clause, not by the join cardinality—but watch fan-out from many-rows-per-key dimensions.
-
Three-way join — fact + dim(city) + dim(dietary), then
GROUP BY c.city. -
Fan-out defense — pre-aggregate the dim to one row per key, or use
COUNT(DISTINCT restaurant_id). - Worked example: two cities, three restaurants; per-city percentage after the join + group.
| city | total | veg_count | veg_percentage |
|---|---|---|---|
| Denver | 2 | 1 | 50.00 |
| Seattle | 1 | 1 | 100.00 |
SELECT
c.city,
ROUND(
100.0 * SUM(CASE WHEN d.is_vegetarian THEN 1 ELSE 0 END) / NULLIF(COUNT(*), 0),
2
) AS veg_percentage
FROM restaurants r
JOIN cities c ON c.city_id = r.city_id
JOIN dietary d ON d.restaurant_id = r.restaurant_id
GROUP BY c.city
ORDER BY veg_percentage DESC;
Common beginner mistakes
- Integer division:
2 / 3 = 0instead of 0.6667. Multiply by100.0(a float) to fix. - Forgetting
NULLIF(denominator, 0)—division by zero crashes the query. - Using
COUNT(*)after a fan-out join—overcounts; useCOUNT(DISTINCT ...)instead. - Computing the numerator and denominator in two separate queries—slower, more verbose.
- Not casting result to a fixed precision—percentages display with float noise (e.g. 66.66666666666667).
SQL interview question on percentage with join
Tables restaurants(restaurant_id, city_id), cities(city_id, city), dietary(restaurant_id, is_vegetarian BOOLEAN). Return per-city percentage of vegetarian-only restaurants. Round to 2 decimals; sort by percentage descending.
Solution using conditional aggregation, NULLIF guard, and three-way join
SELECT
c.city,
ROUND(
100.0 * SUM(CASE WHEN d.is_vegetarian THEN 1 ELSE 0 END) / NULLIF(COUNT(*), 0),
2
) AS veg_percentage
FROM restaurants r
JOIN cities c ON c.city_id = r.city_id
JOIN dietary d ON d.restaurant_id = r.restaurant_id
GROUP BY c.city
ORDER BY veg_percentage DESC, c.city ASC;
Step-by-step trace (input: 3 restaurants across 2 cities):
restaurants(restaurant_id, city_id):
| restaurant_id | city_id |
|---|---|
| 1 | 1 |
| 2 | 1 |
| 3 | 2 |
cities(city_id, city):
| city_id | city |
|---|---|
| 1 | Denver |
| 2 | Seattle |
dietary(restaurant_id, is_vegetarian):
| restaurant_id | is_vegetarian |
|---|---|
| 1 | TRUE |
| 2 | FALSE |
| 3 | TRUE |
-
Inner three-way join —
restaurants r×cities concity_id×dietary donrestaurant_id. Result: 3 joined rows, each row has city name +is_vegetarianflag. -
GROUP BY c.city— collapses to one row per city. Denver has 2 grouped rows; Seattle has 1. -
Per-group
COUNT(*)— Denver: 2; Seattle: 1. -
Per-group conditional sum —
SUM(CASE WHEN d.is_vegetarian THEN 1 ELSE 0 END). Denver: TRUE + FALSE →1. Seattle: TRUE →1. -
Percentage formula per group — Denver:
100.0 * 1 / NULLIF(2, 0) = 50.0. Seattle:100.0 * 1 / NULLIF(1, 0) = 100.0. -
ROUND(..., 2)— Denver →50.00; Seattle →100.00. -
ORDER BY veg_percentage DESC, c.city ASC— Seattle (100) first, Denver (50) second; the secondary key resolves any ties.
Output:
| city | veg_percentage |
|---|---|
| Seattle | 100.00 |
| Denver | 50.00 |
Why this works — concept by concept:
-
Three-way join attaches dimensions —
r ⨝ cprovides the city label,r ⨝ dprovides the vegetarian flag; the fact tablerestaurantsdefines the grain (one row per restaurant) before the group. -
Conditional aggregation as numerator —
SUM(CASE WHEN d.is_vegetarian THEN 1 ELSE 0 END)counts only vegetarian rows in a single pass alongside the denominator — no separate query. -
COUNT(*)as total denominator — counts all joined rows in the group; pairs naturally with the numerator above the same group. -
Float-literal multiplier —
100.0 *(rather than100 *) forces float division so1 / 2evaluates as0.5, not integer0. -
NULLIF(COUNT(*), 0)divide-by-zero guard — converts a zero-row group intoNULL, which makes the division yieldNULLinstead of crashing. -
ROUND(..., 2)precision clamp — strips float noise (e.g.66.66666666666667) so the result reads66.67. -
Deterministic ordering —
ORDER BY veg_percentage DESC, c.city ASCsorts top-down, with the city name as a stable tie-breaker. -
Cost — one scan per table, hash join on the small dimensions, hash aggregate by city; net
O(N)whereN = |restaurants|.
SQL
Topic — aggregation
Aggregation problems
SQL
Topic — joins
Joins problems
5. SQL Self-Joins with Date Functions for Roundtrip Detection
Self-joins with date filters in SQL for data engineering
Roundtrip detection is the canonical self-join interview pattern with a date twist. The mental model: a roundtrip = two flights where the second's origin and destination are swapped versus the first, and the second's date is within the return window. Self-join flights to itself with two aliases (outbound o, inbound i), join on o.dest = i.origin AND o.origin = i.dest, and filter on i.date within N days of o.date.
Self-join basics: alias the same table twice
A self-join uses one physical table as two logical references in the same query. Aliases are required to disambiguate columns; the ON clause defines the relationship. The invariant: aliasing makes intent explicit—state the relationship out loud: "outbound's destination equals inbound's origin and vice versa."
-
flights AS o,flights AS i— two logical aliases on one physical table. -
ONclause carries all relationship constraints; aliases alone do nothing. -
Worked example: two flights
(101, SFO→JFK)and(102, JFK→SFO).
| flight_id | origin | dest | date |
|---|---|---|---|
| 101 | SFO | JFK | 2026-04-01 |
| 102 | JFK | SFO | 2026-04-05 |
Self-join with o.dest = i.origin AND o.origin = i.dest returns the pair (101, 102).
SELECT o.flight_id AS outbound, i.flight_id AS inbound
FROM flights o
JOIN flights i ON o.dest = i.origin AND o.origin = i.dest;
Roundtrip pattern: origin↔dest swap
The roundtrip-pattern join condition is the swap: o.dest = i.origin AND o.origin = i.dest. Add o.flight_id != i.flight_id to defend against degenerate self-pairings. The invariant: roundtrip = origin/destination swap on the second flight with a self-pair guard.
- Two equality constraints, AND-joined—both directions must match.
-
o.flight_id != i.flight_id— prevents pairing a row with itself (e.g. a circular tour). - Multi-leg trips (A→B→C, C→B→A) require recursive CTEs—out of scope for the simple two-table form.
- Worked example: same two flights; swap-condition matches; result = 1 pair.
SELECT o.flight_id AS outbound, i.flight_id AS inbound
FROM flights o
JOIN flights i
ON o.dest = i.origin
AND o.origin = i.dest
AND o.flight_id != i.flight_id;
Date filter with INTERVAL for return within N days
A real roundtrip has a date constraint: the inbound flight returns within N days of the outbound. PostgreSQL interval arithmetic (o.date + INTERVAL '7 days') makes this a one-liner; the choice of BETWEEN vs half-open >= / < controls boundary semantics. The invariant: interval arithmetic + date filter constrains the return window.
-
BETWEENis inclusive on both ends—use only if both ends are intended. -
Half-open
>= / <— safer default for "within 7 days" semantics; matches partition arithmetic elsewhere. -
Timezone caveat —
INTERVAL '7 days'onTIMESTAMP WITH TIME ZONEis7 * 86400seconds, not always exactly 7 calendar days under DST. PureDATEcolumns are safe. -
Worked example: flights 101 (2026-04-01) and 102 (2026-04-05) → 4 days apart, inside a 7-day window → pair
(101, 102)returned.
SELECT o.flight_id AS outbound, i.flight_id AS inbound, o.date AS depart, i.date AS return
FROM flights o
JOIN flights i
ON o.dest = i.origin
AND o.origin = i.dest
AND o.flight_id != i.flight_id
WHERE i.date >= o.date
AND i.date < o.date + INTERVAL '7 days';
Common beginner mistakes
- Forgetting to alias—
SELECT origin FROM flights JOIN flights ON ...is ambiguous. - Joining with
o.origin = i.origin AND o.dest = i.dest(same direction)—gives same-leg pairs, not roundtrips. - Forgetting
o.flight_id != i.flight_idand pairing flights with themselves. - Using
BETWEENwhen half-open>= / <is the correct convention—off-by-one boundary inclusion. - Computing date arithmetic in the application layer (Python), missing the SQL
INTERVALshortcut.
Practice self-join SQL problems →
SQL interview question on roundtrip flights
Table flights(flight_id, origin, dest, date). Return all (outbound, inbound) flight-id pairs where the inbound trip is the reverse of the outbound and returns within 7 days. Sort by outbound ascending.
Solution using a self-join with origin↔dest swap and date filter
SELECT
o.flight_id AS outbound,
i.flight_id AS inbound,
o.date AS depart_date,
i.date AS return_date
FROM flights o
JOIN flights i
ON o.dest = i.origin
AND o.origin = i.dest
AND o.flight_id != i.flight_id
WHERE i.date >= o.date
AND i.date < o.date + INTERVAL '7 days'
ORDER BY o.flight_id;
Step-by-step trace (input: flights table below):
| flight_id | origin | dest | date |
|---|---|---|---|
| 101 | SFO | JFK | 2026-04-01 |
| 102 | JFK | SFO | 2026-04-05 |
| 103 | LAX | SEA | 2026-04-02 |
| 104 | SEA | LAX | 2026-04-15 |
-
Alias the table twice —
flights o(outbound) andflights i(inbound) reference the same physical rows under two logical names. -
Swap-condition match — for each
(o, i)candidate, requireo.dest = i.originANDo.origin = i.dest. Pair(101, 102):o.dest=JFK = i.origin=JFKando.origin=SFO = i.dest=SFO→ match. Pair(103, 104):o.dest=SEA = i.origin=SEAando.origin=LAX = i.dest=LAX→ match. -
Self-pair guard —
o.flight_id != i.flight_idensures we don't pair flight 101 with itself; both candidate pairs have distinct ids → keep. -
Date filter — pair
(101, 102)—i.date = 2026-04-05 >= o.date = 2026-04-01AND2026-04-05 < 2026-04-01 + 7 days = 2026-04-08→ both true, keep. -
Date filter — pair
(103, 104)—i.date = 2026-04-15 >= o.date = 2026-04-02true, but2026-04-15 < 2026-04-02 + 7 days = 2026-04-09is false → drop. -
ORDER BY o.flight_id— sorts the surviving pair(s) by outbound id ascending;(101, 102)is the only result.
Output:
| outbound | inbound | depart_date | return_date |
|---|---|---|---|
| 101 | 102 | 2026-04-01 | 2026-04-05 |
Why this works — concept by concept:
-
Self-join with two aliases —
flights oandflights igive the same physical table two logical roles; aliases are required to disambiguate columns and to express the relationship. -
Origin↔destination swap — the equality pair
o.dest = i.origin AND o.origin = i.destencodes "the inbound flight is the reverse of the outbound" — purely structural, no domain knowledge needed. -
Self-pair guard —
o.flight_id != i.flight_idblocks degenerate pairings of a row with itself (would otherwise match for circular routes whereorigin = dest). -
Half-open date window —
i.date >= o.date AND i.date < o.date + INTERVAL '7 days'matches the canonical "within 7 days" semantics — inclusive lower bound, exclusive upper bound — safe from off-by-one boundary bugs. -
INTERVALarithmetic — keeps the date math in SQL rather than the application layer; one expression both portable and readable. -
Deterministic ordering —
ORDER BY o.flight_idproduces stable output, important for tests and downstream diffs. -
Cost — index-supported hash join on
(origin, dest)plus a filter pass;O(N + M)where N is outbound and M is inbound candidates.
SQL
Topic — self-join
Self-join problems
COMPANY
Oracle — date functions
Oracle-tagged date functions
6. SQL Window-Function Ranking for Top-K per Group
Top-K-per-group with window ranking in SQL for data engineering
"For each group, return the top-K rows by some metric" is the most-asked SQL window-function pattern. The mental model: PARTITION BY resets the ranking at each group; ORDER BY metric DESC ranks within group top-first; CTE wrapper + WHERE rn <= K keeps the top K. Three independent pieces, one window expression.
ROW_NUMBER() vs RANK() for top-K
Three ranking functions exist; the choice depends on tie semantics. The decision rule: ROW_NUMBER for "exactly K rows," RANK for "tied for first," DENSE_RANK for "no gaps". Default to ROW_NUMBER with a deterministic tie-break key (e.g. product_id ASC).
-
ROW_NUMBER()— sequential1, 2, 3, ...even on ties; deterministic with a tie-break. -
RANK()— ties share rank, next rank skips (1, 1, 3). -
DENSE_RANK()— ties share rank, no gap (1, 1, 2). -
Worked example: shop A has products X (100), Y (80), Z (80), W (50).
ROW_NUMBERreturns X, Y, Z for top-3;RANKalso returns X, Y, Z (Y and Z tied at rank 2).
| shop | product | units_sold | row_number | rank |
|---|---|---|---|---|
| A | X | 100 | 1 | 1 |
| A | Y | 80 | 2 | 2 |
| A | Z | 80 | 3 | 2 |
| A | W | 50 | 4 | 4 |
SELECT
shop_id, product_id, units_sold,
ROW_NUMBER() OVER (PARTITION BY shop_id ORDER BY units_sold DESC, product_id ASC) AS rn,
RANK() OVER (PARTITION BY shop_id ORDER BY units_sold DESC) AS rnk
FROM sales;
PARTITION BY shop_id ORDER BY units_sold DESC
PARTITION BY is the per-group reset; each group gets its own ranking sequence starting at 1. ORDER BY units_sold DESC sorts within group, top-first. Always add a deterministic tie-break key—interviewers grade whether you noticed.
-
PARTITION BY shop_id— resets numbering at each shop (or week, region, category). -
ORDER BY units_sold DESC, product_id ASC— primary sort + secondary tie-break. - Worked example: two shops, the window resets at the shop boundary.
| shop | product | rn |
|---|---|---|
| A | X | 1 |
| A | Y | 2 |
| A | Z | 3 |
| A | W | 4 |
| B | M | 1 |
| B | N | 2 |
| B | O | 3 |
Filtering top-K via CTE wrapper
Standard SQL evaluates WHERE before windows, so you can't filter on rn directly in the same query. Wrap the windowed query in a CTE; filter the CTE outside. The invariant: anything you compute with OVER() requires a CTE wrapper to filter.
- CTE pattern — portable across PostgreSQL, MySQL 8+, SQL Server, Oracle.
-
QUALIFYclause — Snowflake / BigQuery / Teradata shorthand; not portable to PostgreSQL. -
Cost —
O(N log N)for the partition-aware sort,O(N)for the filter; netO(N log N). -
Worked example: filter
WHERE rn <= 3keeps X, Y, Z for shop A and M, N, O for shop B.
WITH ranked AS (
SELECT
shop_id, product_id, units_sold,
ROW_NUMBER() OVER (PARTITION BY shop_id ORDER BY units_sold DESC, product_id ASC) AS rn
FROM sales
)
SELECT shop_id, product_id, units_sold
FROM ranked
WHERE rn <= 3
ORDER BY shop_id, rn;
Common beginner mistakes
- Writing
WHERE rn <= 3directly on a window column—evaluation order makes it invalid. - Using
RANKwhen the prompt wants exactly K rows and ties exist. - Forgetting the tie-break secondary key—non-deterministic ordering for ties.
- Forgetting
PARTITION BYand ranking globally instead of per-shop. - Filtering with a correlated subquery instead of windows—
O(N²)instead ofO(N log N).
See more window-function problems →
SQL interview question on top-K ranking
Table sales(shop_id, product_id, units_sold). Return the top 3 products by units_sold for each shop, ties broken by product_id ascending. Sort output by shop_id, then rn.
Solution using ROW_NUMBER partitioned by shop
WITH ranked AS (
SELECT
shop_id,
product_id,
units_sold,
ROW_NUMBER() OVER (
PARTITION BY shop_id
ORDER BY units_sold DESC, product_id ASC
) AS rn
FROM sales
)
SELECT shop_id, product_id, units_sold
FROM ranked
WHERE rn <= 3
ORDER BY shop_id, rn;
Step-by-step trace (input: sales table below):
| shop_id | product_id | units_sold |
|---|---|---|
| A | X | 100 |
| A | Y | 80 |
| A | Z | 80 |
| A | W | 50 |
| B | M | 60 |
| B | N | 60 |
| B | O | 40 |
-
Build the
rankedCTE — computeROW_NUMBER()per row usingPARTITION BY shop_id ORDER BY units_sold DESC, product_id ASC. - Partition reset at each shop — windowing starts fresh at shop A and again at shop B; rankings are independent per partition.
-
Apply the order key in shop A —
units_sold DESCputs X (100) first; Y and Z tie at 80, broken byproduct_id ASC(Y before Z); W (50) last →rn = 1, 2, 3, 4. -
Apply the order key in shop B — M and N tie at 60, broken by
product_id(M before N); O (40) last →rn = 1, 2, 3. -
Outer filter
WHERE rn <= 3— keeps shop A rows X (1), Y (2), Z (3); drops W (4). Keeps all three shop B rows (rn 1-3). -
ORDER BY shop_id, rn— sorts shop A's three rows first (inrnorder), then shop B's three rows.
Output:
| shop_id | product_id | units_sold |
|---|---|---|
| A | X | 100 |
| A | Y | 80 |
| A | Z | 80 |
| B | M | 60 |
| B | N | 60 |
| B | O | 40 |
Why this works — concept by concept:
-
PARTITION BY shop_idfor per-group reset — the window restarts numbering at each shop, so the sameROW_NUMBER()expression produces independent rankings per group; no groupwise subqueries needed. -
ORDER BY units_sold DESC, product_id ASC— primary key sorts top-first; the secondaryproduct_id ASCproduces a stable, deterministic tie-break grading interviewers look for. -
ROW_NUMBERoverRANK—ROW_NUMBERalways emits1, 2, 3, ...(no ties, no skips), keeping exactly K rows per shop;RANKcould emit1, 2, 2, 4and break the "top-3 only" guarantee. -
CTE wrapper for window filtering — standard SQL evaluates
WHEREbeforeOVER(), soWHERE rn <= 3is invalid in the same query as the window expression; wrapping inWITH ranked AS (...)lets the outerWHEREreferencern. -
Outer
ORDER BY shop_id, rn— produces clean, deterministic output ordering for downstream consumption (tests, dashboards). -
Cost — partition-aware sort is
O(N log N)overall (effectivelyO(N log K)per partition of size K); the outer filter isO(N)— netO(N log N).
SQL
Topic — ranking
Ranking problems
SQL
Topic — window functions
Window function problems
7. SQL Date Functions for KPI Analysis
Date functions for KPI analysis in SQL for data engineering
KPI analysis is the standard end-of-quarter reporting workload: compute daily, weekly, monthly metrics across a date range. The mental model: bucket timestamps with DATE_TRUNC; group by the bucket; compute multiple metrics per group with conditional aggregation. One query returns all KPIs—no separate query per metric.
DATE_TRUNC for KPI bucketing (daily / weekly / monthly)
DATE_TRUNC(unit, ts) truncates a timestamp to the start of the named unit ('day', 'week', 'month', 'quarter'). The invariant: DATE_TRUNC produces a left-edge timestamp for the bucket.
- Same pattern across grains — only the unit string changes; aggregation logic stays the same.
-
Week start — PostgreSQL returns Monday (ISO 8601). Shift to Sunday-start with
DATE_TRUNC('week', ts + INTERVAL '1 day') - INTERVAL '1 day'. - Worked example: three rides bucketed to day and hour.
| ride_id | ride_start_ts | DATE_TRUNC('day', ts) | DATE_TRUNC('hour', ts) |
|---|---|---|---|
| 1 | 2026-04-28 09:15 | 2026-04-28 00:00 | 2026-04-28 09:00 |
| 2 | 2026-04-28 11:30 | 2026-04-28 00:00 | 2026-04-28 11:00 |
| 3 | 2026-04-29 02:00 | 2026-04-29 00:00 | 2026-04-29 02:00 |
SELECT
DATE_TRUNC('day', ride_start_ts) AS day,
COUNT(*) AS rides
FROM rides
GROUP BY day
ORDER BY day;
EXTRACT and EPOCH for duration metrics
Duration is a difference between two timestamps; subtraction returns an interval. EXTRACT(EPOCH FROM interval) converts the interval to total seconds as a numeric. The invariant: EXTRACT(EPOCH FROM interval) converts duration to seconds—divide by 60 / 3600 / 86400 for minutes / hours / days.
-
(end_ts - start_ts)— interval; not directly numeric. -
EXTRACT(EPOCH FROM ...)— total seconds asnumeric. -
AVG(EXTRACT(EPOCH FROM (end - start)))— aggregate duration across rides. - Worked example: average duration per day.
| day | avg_duration_minutes |
|---|---|
| 2026-04-28 | 12.5 |
| 2026-04-29 | 8.0 |
SELECT
DATE_TRUNC('day', ride_start_ts) AS day,
AVG(EXTRACT(EPOCH FROM (ride_end_ts - ride_start_ts)) / 60) AS avg_duration_min
FROM rides
GROUP BY day
ORDER BY day;
Composing multiple KPIs in one query with conditional aggregates
Real KPI dashboards need many metrics in one shot: total rides, average duration, peak-hour share, weekend split. Combine DATE_TRUNC bucketing with conditional aggregation in a single SELECT. The invariant: composing aggregates in a single SELECT reads as a metric definition—adding a new KPI is one new aggregate expression.
-
COUNT(*)— total rides. -
AVG(EXTRACT(EPOCH FROM ...))— average duration. -
SUM(CASE WHEN EXTRACT(HOUR FROM ts) BETWEEN 7 AND 9 THEN 1 ELSE 0 END)— morning-peak count. -
SUM(CASE WHEN EXTRACT(DOW FROM ts) IN (0, 6) THEN 1 ELSE 0 END)— weekend count (Sunday=0, Saturday=6). - Worked example: two days, multi-metric output.
| day | total_rides | avg_duration_min | morning_peak | weekend_rides |
|---|---|---|---|---|
| 2026-04-28 (Tue) | 100 | 12.5 | 25 | 0 |
| 2026-05-02 (Sat) | 150 | 8.0 | 30 | 150 |
SELECT
DATE_TRUNC('day', ride_start_ts) AS day,
COUNT(*) AS total_rides,
AVG(EXTRACT(EPOCH FROM (ride_end_ts - ride_start_ts)) / 60) AS avg_duration_min,
SUM(CASE WHEN EXTRACT(HOUR FROM ride_start_ts) BETWEEN 7 AND 9 THEN 1 ELSE 0 END) AS morning_peak_rides,
SUM(CASE WHEN EXTRACT(DOW FROM ride_start_ts) IN (0, 6) THEN 1 ELSE 0 END) AS weekend_rides
FROM rides
GROUP BY day
ORDER BY day;
Common beginner mistakes
- Using
BETWEEN '2026-04-01' AND '2026-04-30'for "all of April" on a timestamp column—over-includes the upper boundary. - Computing each KPI in a separate query—slow, redundant, hard to maintain.
- Forgetting
GROUP BYafter aggregates—syntax error. -
EXTRACT(DOW FROM ts)is 0-indexed (Sunday = 0)—off-by-one in weekend filters. - Using
INTERVAL '7 day'instead ofINTERVAL '7 days'—both work in PostgreSQL but the plural form is canonical.
SQL interview question on KPI analysis
Table rides(ride_id, ride_start_ts, ride_end_ts, fare). Return per-day KPIs: total rides, average ride duration in minutes, total fare, percentage of rides in morning peak (7–9 AM), and weekend ride count. Sort by day.
Solution using DATE_TRUNC + EXTRACT + conditional aggregation
SELECT
DATE_TRUNC('day', ride_start_ts) AS day,
COUNT(*) AS total_rides,
ROUND(
AVG(EXTRACT(EPOCH FROM (ride_end_ts - ride_start_ts)) / 60),
2
) AS avg_duration_min,
SUM(fare) AS total_fare,
ROUND(
100.0 * SUM(CASE WHEN EXTRACT(HOUR FROM ride_start_ts) BETWEEN 7 AND 9 THEN 1 ELSE 0 END)
/ NULLIF(COUNT(*), 0),
2
) AS morning_peak_pct,
SUM(CASE WHEN EXTRACT(DOW FROM ride_start_ts) IN (0, 6) THEN 1 ELSE 0 END) AS weekend_rides
FROM rides
GROUP BY day
ORDER BY day;
Step-by-step trace (input: rides table below — Tue 2026-04-28 and Sat 2026-05-02):
| ride_id | ride_start_ts | ride_end_ts | fare |
|---|---|---|---|
| 1 | 2026-04-28 08:00 | 2026-04-28 08:15 | 12.00 |
| 2 | 2026-04-28 12:00 | 2026-04-28 12:10 | 10.00 |
| 3 | 2026-05-02 09:00 | 2026-05-02 09:08 | 8.00 |
| 4 | 2026-05-02 14:00 | 2026-05-02 14:08 | 8.00 |
-
Bucket by day —
DATE_TRUNC('day', ride_start_ts)maps rides 1, 2 →2026-04-28; rides 3, 4 →2026-05-02. -
GROUP BY day— collapses to one row per day; Tue group has 2 rides, Sat group has 2 rides. -
COUNT(*)— Tue: 2; Sat: 2. -
Average duration — Tue: avg of 15 + 10 = 12.5 min; Sat: avg of 8 + 8 = 8.0 min.
EXTRACT(EPOCH FROM (end - start)) / 60converts interval to minutes. -
SUM(fare)— Tue:12 + 10 = 22.00; Sat:8 + 8 = 16.00. -
Morning-peak count —
EXTRACT(HOUR FROM ride_start_ts) BETWEEN 7 AND 9. Tue ride 1 (08:00) qualifies, ride 2 (12:00) doesn't → count = 1. Sat ride 3 (09:00) qualifies, ride 4 (14:00) doesn't → count = 1. -
Morning-peak percentage — Tue:
100.0 * 1 / NULLIF(2, 0) = 50.00. Sat:100.0 * 1 / NULLIF(2, 0) = 50.00. Rounded to 2 decimals. -
Weekend count —
EXTRACT(DOW FROM ride_start_ts) IN (0, 6)(Sun=0, Sat=6). Tue (DOW=2):0 + 0 = 0. Sat (DOW=6):1 + 1 = 2. -
ORDER BY day— sorts ascending: Tue first, Sat second.
Output:
| day | total_rides | avg_duration_min | total_fare | morning_peak_pct | weekend_rides |
|---|---|---|---|---|---|
| 2026-04-28 | 2 | 12.50 | 22.00 | 50.00 | 0 |
| 2026-05-02 | 2 | 8.00 | 16.00 | 50.00 | 2 |
Why this works — concept by concept:
-
DATE_TRUNC('day', ts)for bucketing — produces a left-edge timestamp per ride;GROUP BYon this expression collapses all rides on the same calendar day into one group. -
Composing aggregates in one
SELECT— every KPI is a single aggregate expression evaluated in the same pass; adding a new metric is one extra column, not a new query. -
EXTRACT(EPOCH FROM interval)for duration — converts theintervalreturned by(end - start)to total seconds as a numeric; dividing by 60 yields minutes, the readable display unit. -
Conditional aggregation for morning peak —
SUM(CASE WHEN EXTRACT(HOUR ...) BETWEEN 7 AND 9 THEN 1 ELSE 0 END)counts only the qualifying rides in the same group pass — no separate query needed. -
100.0 *float coercion — forces float division so1 / 2produces0.5, not integer0;NULLIF(COUNT(*), 0)guards empty buckets from divide-by-zero. -
EXTRACT(DOW FROM ts) IN (0, 6)for weekends — Sunday is0, Saturday is6in PostgreSQL — the off-by-one trap interviewers ask about; one expression covers both weekend days. -
ROUND(..., 2)for clean display — strips float noise from the average and percentage outputs. -
Cost — one scan of
rides, hash aggregate by truncated day, then a sort;O(N log K)where N is rides and K is distinct days.
SQL
Topic — date functions
Date function problems
COMPANY
Oracle — date functions
Oracle-tagged date functions
Tips to crack Oracle data engineering interviews
These are habits that move the needle in real Oracle loops—not a re-statement of the topics above.
Python preparation
Drill the three patterns in Oracle's Python set: stack for valid brackets, hash table for two-sum, dict / JSON for extraction. Stdlib fluency (json, dict.get, list as stack, dict as hash map) matters more than algorithm depth. Drill problems from the stack, hash table, and dictionary topic pages.
SQL preparation
Drill the four patterns: aggregation with joins for percentages, self-joins with date filters for roundtrips, window-function ranking for top-K, and date functions for KPIs. Memorize the 100.0 * filtered / total percentage formula, the origin↔dest swap for self-joins, the ROW_NUMBER OVER (PARTITION BY ... ORDER BY ... DESC) template, and DATE_TRUNC + EXTRACT for KPIs. The aggregation, self-join, ranking, and date functions topic pages cover the bulk.
Domain-agnostic framing
Oracle's prompts span four verticals (food, flights, retail, mobility). The interviewer is grading whether you map each framing to the right algorithm regardless of domain. State the mapping out loud: "this is conditional aggregation"; "this is a self-join with a date filter"; "this is top-K via window ranking"; "this is multi-KPI date-bucket aggregation."
Where to practice on PipeCode
| Skill lane | Practice path |
|---|---|
| Curated Oracle practice set | /explore/practice/company/oracle |
| Stack | /explore/practice/topic/stack |
| Hash table | /explore/practice/topic/hash-table |
| Dictionary | /explore/practice/topic/dictionary |
| Aggregation | /explore/practice/topic/aggregation |
| Self-join | /explore/practice/topic/self-join |
| Ranking | /explore/practice/topic/ranking |
| Date functions | /explore/practice/topic/date-functions |
| All practice topics | /explore/practice/topics |
| Interview courses | /explore/courses |
Communication under time pressure
State assumptions before typing: "I'll assume flights has unique flight_id"; "I'll assume timestamps are in a single timezone"; "I'll assume rides.fare is non-null." State invariants after key code blocks. State complexity: "this is O(N log N) for the partitioned sort + O(N) for the filter." Interviewers grade clear reasoning above silent-and-perfect.
Frequently Asked Questions
What is the Oracle data engineering interview process like?
The Oracle data engineering interview typically includes a phone screen (mostly Python warm-up around stack / hash table / dict patterns), one or two coding rounds focused on SQL aggregation, joins, and window functions, plus a system-design conversation around pipelines and data architectures, and behavioral interviews. The curated 7-problem Oracle practice set on PipeCode mirrors what you will see on the technical rounds.
What SQL topics does Oracle test for data engineers?
Oracle emphasizes conditional aggregation with joins (for percentages), self-joins with date filters (for relationship detection like roundtrips), window-function ranking (for top-K-per-group), and date functions (DATE_TRUNC + EXTRACT) for KPI analysis. Drill these on the aggregation, self-join, ranking, and date functions topic pages.
How important is Python for an Oracle data engineering interview?
Python is roughly half the technical interview at Oracle. The patterns are classics: stack for bracket validation, hash table for two-sum, dict / JSON for extraction. Stdlib fluency (list, dict, json) matters more than algorithm depth.
Are Oracle's data engineering questions about Oracle Database specifically?
No—despite the company name, the curated practice problems are dialect-agnostic SQL (typically PostgreSQL syntax) and standard Python. There's no Oracle-Database-specific PL/SQL or proprietary feature in the curated set. Prepare for PostgreSQL-flavor SQL; the patterns translate to Oracle Database SQL with minor syntax adjustments (e.g. NVL instead of COALESCE).
How hard are Oracle data engineering interview questions?
Oracle's curated set has 4 easy + 3 medium + 0 hard = the most fundamentals-friendly hub of the seven covered so far. No exotic algorithms (no Morris traversal, no DP, no advanced graph algos)—just classic interview patterns. Good entry hub for early-career candidates and quick refresher for senior candidates.
How many Oracle practice problems should I solve before the interview?
Aim for 20–30 problems spanning all seven topic clusters above—not 100 of the same kind. Solve every problem in the Oracle-tagged practice set, then back-fill weak areas using the topic pages linked throughout this guide.
Start practicing Oracle data engineering problems
Reading patterns is not the same as typing them under time pressure. PipeCode pairs company-tagged Oracle problems with tests, AI feedback, and a coding environment so you can drill the exact SQL and Python patterns Oracle asks—without the noise of generic algorithm prep that doesn't apply to this loop.
Pipecode.ai is Leetcode for Data Engineering.



Top comments (0)