Gowtham Potureddi

Posted on May 1

Intuit Data Engineering Interview Questions

#python #dataengineering #sql #interview

Intuit data engineering interview questions balance SQL and Python with a fintech analytics edge: five primitives spanning both languages — RANK() OVER (ORDER BY AVG(amount) DESC) for window-function ranking over aggregates, JOIN + subquery salary IN (SELECT salary FROM employees GROUP BY salary HAVING COUNT(*) > 1) for same-salary employees, SQL regex auth_code ~ '^[0-9]+$' for numeric-only authorization codes, Python right-to-left scan for the largest odd-number substring, and collections.Counter plus a tuple sort key for country-count rollups. The framings are finance and SaaS — transactions, expenses, salaries, authorization codes, country counts — exactly the data shapes a TurboTax / QuickBooks / Mint engineer ships every week.

This guide walks through the five topic clusters Intuit actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 5-problem Intuit set (4 easy, 1 medium, no hard) — a fundamentals-and-classics hub that rewards window-function fluency, regex pattern recognition, and clean Python rollup primitives.

#	Topic (sections 1–5)	Why it shows up at Intuit
1	SQL window functions for ranking transaction types by average expense	Rank Transaction Types by Average Expense (MEDIUM) — `RANK() OVER (ORDER BY AVG(amount) DESC)` over a `GROUP BY transaction_type` aggregate.
2	SQL JOINs and subqueries for employees with the same salary	Employees With Same Salary — `WHERE salary IN (SELECT salary FROM employees GROUP BY salary HAVING COUNT(*) > 1)`.
3	SQL regular expressions for numeric authorization codes	Numeric Authorization Code Transactions — `WHERE auth_code ~ '^[0-9]+$'`.
4	Python string manipulation for the largest odd-number substring	Largest Odd Number in String — right-to-left scan for the first odd digit; return the prefix `s[:i+1]`.
5	Python hash tables and sorting for country-count rollups	Count Countries — `Counter(...)` plus `sorted(items(), key=lambda kv: (-kv[1], kv[0]))`.

1. SQL Window Functions for Ranking Transaction Types by Average Expense

Window functions over aggregates for ranked rollups in SQL for data engineering

"Rank transaction types by their average expense" is the canonical window-function-over-aggregate interview prompt at Intuit. The mental model: GROUP BY transaction_type collapses to one row per type with AVG(amount); RANK() OVER (ORDER BY AVG(amount) DESC) then ranks those rows. The window function operates on the aggregated result set, not the raw rows — that's the structural distinction every candidate must articulate. The same shape powers any "rank entities by aggregated metric" query — top customers by revenue, top categories by margin, top regions by churn.

Pro tip: RANK, DENSE_RANK, and ROW_NUMBER all rank — but they differ on ties. Memorize the difference: RANK skips numbers (1, 1, 3); DENSE_RANK doesn't (1, 1, 2); ROW_NUMBER arbitrarily breaks ties (1, 2, 3 even on duplicates). For "rank with gaps," use RANK; for "rank without gaps," use DENSE_RANK; for "deterministic sequential," use ROW_NUMBER with a secondary ORDER BY tie-break.

`GROUP BY transaction_type` + `AVG(amount)` for per-type average

The aggregate invariant: GROUP BY transaction_type produces one row per type; AVG(amount) computes the mean within each group. The output is the canonical per-entity summary — the input to the window function in the next step.

GROUP BY transaction_type — one row per type.
AVG(amount) — mean amount in the group.
COUNT(*) — group size; useful for filtering thin groups via HAVING COUNT(*) >= N.
SUM(amount) — total instead of average; pick the metric the prompt asks for.

Worked example. Six transactions across three types.

transaction_type	amount
groceries	50
groceries	30
rent	1200
rent	1200
dining	80
dining	60

After GROUP BY transaction_type, AVG(amount):

transaction_type	avg_amount
rent	1200.00
dining	70.00
groceries	40.00

Worked-example solution.

SELECT transaction_type, AVG(amount) AS avg_amount
FROM transactions
GROUP BY transaction_type
ORDER BY avg_amount DESC;

Rule of thumb: always pin the aggregate step in a CTE before applying a window function — it makes the structure clear to the interviewer and the planner.

`RANK() OVER (ORDER BY AVG(amount) DESC)` for windowed ranking

The window-rank invariant: RANK() OVER (ORDER BY AVG(amount) DESC) runs over the aggregated rows; it does not need a PARTITION BY if the entire result is one ranked list. The window function evaluates after the GROUP BY and produces one rank value per row in the aggregated output.

RANK() OVER (ORDER BY AVG(amount) DESC) — descending rank.
No PARTITION BY — single global ranking across all transaction types.
PARTITION BY region — one ranking per region (different question).
Tie-break inside ORDER BY — ORDER BY AVG(amount) DESC, transaction_type ASC for deterministic output on equal averages.

Worked example. Apply RANK() to the previous aggregate.

transaction_type	avg_amount	rank
rent	1200.00	1
dining	70.00	2
groceries	40.00	3

Worked-example solution.

SELECT
  transaction_type,
  AVG(amount) AS avg_amount,
  RANK() OVER (ORDER BY AVG(amount) DESC) AS rk
FROM transactions
GROUP BY transaction_type
ORDER BY rk;

Rule of thumb: a window function inside the same SELECT as a GROUP BY runs after the aggregation — you can reference AVG(amount) inside the window's ORDER BY without a CTE wrapper.

Tied averages: `RANK` skips, `DENSE_RANK` doesn't

The tie invariant: two types with equal averages share a rank; RANK skips the next number, DENSE_RANK doesn't. For "rank with gaps" (1, 1, 3) use RANK; for "rank without gaps" (1, 1, 2) use DENSE_RANK. The difference matters for paginated leaderboards and any consumer that expects contiguous rank numbers.

RANK() — ties share, then skip.
DENSE_RANK() — ties share, no skip.
ROW_NUMBER() — every row gets a unique rank, ties broken arbitrarily.
Always add a tie-break in ORDER BY to make ROW_NUMBER deterministic.

Worked example. Two types tie at 70.00.

type	avg	RANK	DENSE_RANK	ROW_NUMBER
rent	1200	1	1	1
dining	70	2	2	2
takeout	70	2	2	3
groceries	40	4	3	4

Worked-example solution.

SELECT
  transaction_type,
  AVG(amount) AS avg_amount,
  RANK()       OVER (ORDER BY AVG(amount) DESC) AS rk,
  DENSE_RANK() OVER (ORDER BY AVG(amount) DESC) AS drk,
  ROW_NUMBER() OVER (ORDER BY AVG(amount) DESC, transaction_type ASC) AS rn
FROM transactions
GROUP BY transaction_type
ORDER BY rk;

Rule of thumb: RANK is the default expected answer when the prompt says "rank" without qualification; mention DENSE_RANK and ROW_NUMBER as alternatives during your verbal walkthrough.

Common beginner mistakes

Trying WHERE RANK() OVER (...) = 1 — illegal; window functions can't appear in WHERE. Wrap in CTE first.
Forgetting DESC and getting ascending rank by default.
Confusing per-row rank with per-aggregate rank — clarify which one the prompt wants.
Skipping the tie-break and surprising the test harness with non-deterministic output.
Aggregating without GROUP BY and seeing only one row — every column not behind an aggregate must be in GROUP BY.

SQL Interview Question on Ranking Transaction Types by Average Expense

Table transactions(transaction_id INT, transaction_type TEXT, amount NUMERIC). Return one row per transaction_type with the average amount and the rank by that average descending. Break ties by alphabetical type ascending. Columns: transaction_type, avg_amount, rk.

Solution Using a window function over a `GROUP BY` aggregate

SELECT
  transaction_type,
  ROUND(AVG(amount), 2) AS avg_amount,
  RANK() OVER (ORDER BY AVG(amount) DESC) AS rk
FROM transactions
GROUP BY transaction_type
ORDER BY rk, transaction_type ASC;

Step-by-step trace (input: 8 rows across 3 types):

transaction_id	transaction_type	amount
1	groceries	50
2	groceries	30
3	groceries	40
4	rent	1200
5	rent	1200
6	dining	80
7	dining	60
8	dining	70

Scan + group — GROUP BY transaction_type produces three groups: groceries (3 rows), rent (2 rows), dining (3 rows).
Compute AVG(amount) —
- groceries: (50+30+40)/3 = 40.00
- rent: (1200+1200)/2 = 1200.00
- dining: (80+60+70)/3 = 70.00
Apply RANK() OVER (ORDER BY AVG(amount) DESC) —
- rent (1200) → rk=1
- dining (70) → rk=2
- groceries (40) → rk=3
ORDER BY rk, transaction_type ASC — primary by rank, secondary alphabetic on ties.

Output:

transaction_type	avg_amount	rk
rent	1200.00	1
dining	70.00	2
groceries	40.00	3

Why this works — concept by concept:

GROUP BY transaction_type — collapses raw transactions into one row per type, the canonical input shape for the window function.
AVG(amount) — the per-type metric; ROUND(..., 2) clamps the float for readable output.
RANK() OVER (ORDER BY AVG(amount) DESC) — runs after the GROUP BY; assigns rank 1 to the highest average, with ties sharing a rank and the next number skipped.
No PARTITION BY — there's a single global ranking across all transaction types; partitioning would create per-partition rankings.
Secondary tie-break in ORDER BY — transaction_type ASC makes the final output deterministic when two types share a rank.
Cost — one scan of transactions, one hash aggregate, one window pass over the aggregated rows → O(N) time, O(K) space where K is the number of distinct types.

SQL
Topic — window functions
Window-function problems

Practice →

SQL
Intuit — aggregation
Intuit aggregation problems

Practice →

2. SQL JOINs and Subqueries for Employees With the Same Salary

Group-and-filter via subquery and `HAVING` for duplicate-detection in SQL for data engineering

"Find employees who share a salary with at least one other employee" is the canonical group-and-filter interview prompt. The mental model: a subquery SELECT salary FROM employees GROUP BY salary HAVING COUNT(*) > 1 produces salary values that appear two or more times; WHERE salary IN (...) keeps employees whose salary is in that set. The same shape powers any "find rows that share a value with at least one other row" question — duplicate emails, duplicate phone numbers, shared store IDs across orders.

Pro tip: Two equivalent forms exist — subquery + IN, or self-join e1.salary = e2.salary AND e1.id <> e2.id. The subquery form is usually cleaner and faster (one GROUP BY + one filter); the self-join form is acceptable but produces extra rows that need DISTINCT. Default to the subquery form unless the interviewer specifically asks for a JOIN.

Subquery: `GROUP BY salary HAVING COUNT(*) > 1`

The duplicate-salary invariant: SELECT salary FROM employees GROUP BY salary HAVING COUNT(*) > 1 returns the set of salary values shared by two or more employees. HAVING filters groups (post-aggregate); WHERE filters rows (pre-aggregate) — confusing them is the most common mistake in this query family.

GROUP BY salary — bucket employees by salary.
HAVING COUNT(*) > 1 — keep only groups with two or more employees.
Output is a set of salary values — usable as the right side of WHERE salary IN (...).
HAVING vs WHERE — HAVING is for predicates over aggregates; WHERE is for predicates over raw rows.

Worked example. Five employees; salaries 50k (×1), 60k (×2), 70k (×1), 80k (×2).

id	name	salary
1	Alice	50000
2	Bob	60000
3	Carol	60000
4	Dave	70000
5	Erin	80000
6	Frank	80000

GROUP BY salary HAVING COUNT(*) > 1 returns:

salary
60000
80000

Worked-example solution.

SELECT salary
FROM employees
GROUP BY salary
HAVING COUNT(*) > 1;

Rule of thumb: "shared by at least N rows" → GROUP BY <field> HAVING COUNT(*) >= N. The subquery returns a list of values; feed it to the outer WHERE.

`WHERE salary IN (...)` filter to keep matching rows

The outer filter invariant: WHERE salary IN (subquery) returns the original rows whose salary is in the subquery's result set. This brings back full employee rows (not just salary values), which is what the prompt asks for.

WHERE salary IN (SELECT ... GROUP BY ... HAVING ...) — semi-join semantics.
WHERE EXISTS (SELECT 1 FROM ... WHERE ...) — equivalent in spirit; sometimes faster on indexed columns.
Order the output — ORDER BY salary, name for deterministic results.
NOT IN caveat — be careful with NULLs in the subquery; use NOT EXISTS for robustness.

Worked example. Apply the outer filter to the previous data.

id	name	salary
2	Bob	60000
3	Carol	60000
5	Erin	80000
6	Frank	80000

Worked-example solution.

SELECT id, name, salary
FROM employees
WHERE salary IN (
  SELECT salary
  FROM employees
  GROUP BY salary
  HAVING COUNT(*) > 1
)
ORDER BY salary ASC, name ASC;

Rule of thumb: the subquery defines "values that share"; the outer query brings back the rows.

Self-join alternative: `e1.salary = e2.salary AND e1.id <> e2.id`

The self-join invariant: alias employees as e1 and e2, join on equal salary and unequal ids; surviving e1 rows have at least one peer. Add SELECT DISTINCT e1.id, e1.name, e1.salary because each e1 row joins to one row per peer.

FROM employees e1 JOIN employees e2 ON e1.salary = e2.salary AND e1.id <> e2.id — pair every employee with every same-salary peer.
SELECT DISTINCT e1.id — collapse multiple peers per e1 to one row.
Performance — usually slower than the subquery form because of the row explosion before DISTINCT.
When to use — when you need attributes from the matching peer (e.g. peer's name, hire date).

Worked example. Same input. Self-join trace.

e1.id	e1.salary	e2.id	e2.salary
2	60000	3	60000
3	60000	2	60000
5	80000	6	80000
6	80000	5	80000

After SELECT DISTINCT: 4 rows ({2, 3, 5, 6}).

Worked-example solution.

SELECT DISTINCT e1.id, e1.name, e1.salary
FROM employees e1
JOIN employees e2
  ON e1.salary = e2.salary AND e1.id <> e2.id
ORDER BY e1.salary ASC, e1.name ASC;

Rule of thumb: default to the subquery form; reach for the self-join only when the interviewer asks "can you do this with a JOIN?" or when the output needs peer attributes.

Common beginner mistakes

Using WHERE COUNT(*) > 1 — illegal; aggregates aren't allowed in WHERE. Use HAVING.
Self-joining without e1.id <> e2.id — every row matches itself, every row appears.
Forgetting DISTINCT on the self-join — duplicate output rows when peers > 1.
WHERE salary = (SELECT salary ...) instead of IN — fails when the subquery returns multiple rows.
Skipping ORDER BY and getting non-deterministic output across runs.

SQL Interview Question on Employees With the Same Salary

Table employees(id INT, name TEXT, salary NUMERIC). Return all employees whose salary is shared with at least one other employee. Columns: id, name, salary. Sort ascending by salary, then by name.

Solution Using subquery + `HAVING COUNT(*) > 1`

SELECT id, name, salary
FROM employees
WHERE salary IN (
  SELECT salary
  FROM employees
  GROUP BY salary
  HAVING COUNT(*) > 1
)
ORDER BY salary ASC, name ASC;

Step-by-step trace (input: 7 employees):

id	name	salary
1	Alice	50000
2	Bob	60000
3	Carol	60000
4	Dave	70000
5	Erin	80000
6	Frank	80000
7	Greg	80000

Inner query — GROUP BY salary HAVING COUNT(*) > 1 →
- 50000 (1) — drops.
- 60000 (2) — keeps.
- 70000 (1) — drops.
- 80000 (3) — keeps. Result set: {60000, 80000}.
Outer WHERE salary IN ({60000, 80000}) — keeps rows with salary in that set: ids 2, 3, 5, 6, 7.
ORDER BY salary ASC, name ASC — primary salary ascending: 60k group first then 80k; secondary name ascending within each group.

Output:

id	name	salary
2	Bob	60000
3	Carol	60000
5	Erin	80000
6	Frank	80000
7	Greg	80000

Why this works — concept by concept:

Inner subquery GROUP BY salary HAVING COUNT(*) > 1 — produces exactly the salary values shared by 2+ employees; HAVING is the right clause because the predicate compares an aggregate.
WHERE salary IN (...) — semi-join: keeps original employee rows whose salary appears in the subquery's set; preserves all original columns.
No JOIN required — the subquery + IN form avoids the row-explosion + DISTINCT overhead of a self-join.
ORDER BY salary ASC, name ASC — deterministic output: groups are listed ascending by salary, names ascending within each group.
HAVING COUNT(*) > 1 — strict greater-than-one; equivalent to "at least one peer."
Cost — one scan of employees for the inner aggregate, one hash lookup against the result for the outer filter → O(N) time, O(K) space where K is the number of duplicated salaries.

SQL
Topic — joins
JOIN problems

Practice →

SQL
Topic — subquery
Subquery problems

Practice →

3. SQL Regular Expressions for Numeric Authorization Codes

Anchored regex patterns for numeric-only matching in SQL for data engineering

"Count transactions whose authorization code is numeric only" is the canonical regex-in-SQL interview prompt. The mental model: the regex '^[0-9]+$' matches a string that starts (^), contains one or more digits ([0-9]+), and ends ($) — with no other characters allowed. The Postgres ~ operator applies a regex; LIKE cannot enforce "all digits" cleanly, which is why regex is the right primitive.

Pro tip: Anchors matter. Without ^ and $, the regex [0-9]+ matches any string that contains at least one digit anywhere — so 'abc123' would pass. Always anchor with ^...$ when the prompt says "only" or "exactly." Drop anchors when matching a substring is enough.

Pattern `^[0-9]+$` — start/end anchors and digit class

The pattern invariant: ^ anchors to start, $ anchors to end, [0-9] is the digit character class, + is "one or more". Together they require the entire string to be one or more digits with nothing else.

^ — start anchor.
$ — end anchor.
[0-9] — character class for digits 0–9.
+ — one or more.
\d — synonym for [0-9] in Postgres (with ~ and the right regex flavor); [0-9] is universally portable.

Worked example. Five sample codes.

auth_code	matches `^[0-9]+$`?
`'12345'`	yes
`'00007'`	yes
`'A1234'`	no (leading letter)
`'1234X'`	no (trailing letter)
`''`	no (empty; `+` requires at least one digit)

Worked-example solution.

SELECT auth_code, (auth_code ~ '^[0-9]+$') AS is_numeric
FROM transactions
ORDER BY auth_code;

Rule of thumb: ^X$ means "exactly X"; X alone means "contains X somewhere." Pick deliberately.

Postgres `~` operator vs `LIKE`

The operator invariant: ~ is a Postgres regex match; LIKE is a glob-style wildcard match. LIKE cannot enforce "all digits" without exhaustive character-by-character checks; ~ does it in one expression.

col ~ 'pattern' — Postgres regex match (case-sensitive).
col ~* 'pattern' — case-insensitive variant.
col !~ 'pattern' — does not match.
col LIKE 'X%' — starts with X; cannot express "only digits."
Cross-dialect — Snowflake REGEXP_LIKE(col, pattern), MySQL col REGEXP pattern, BigQuery REGEXP_CONTAINS(col, pattern).

Worked example. Compare ~ and LIKE for "all digits."

approach	code	works?
`auth_code LIKE '[0-9]%'`	wildcard	no — Postgres `LIKE` doesn't do character classes
`auth_code SIMILAR TO '[0-9]+'`	SQL standard regex	yes (Postgres)
`auth_code ~ '^[0-9]+$'`	Postgres regex	yes

Worked-example solution.

SELECT
  auth_code,
  (auth_code ~ '^[0-9]+$')          AS via_tilde,
  (auth_code SIMILAR TO '[0-9]+')   AS via_similar_to
FROM transactions
ORDER BY auth_code;

Rule of thumb: default to ~ on Postgres for portability between simple regex and capture-group extraction; SIMILAR TO is rarely used in practice.

Counting matches: `COUNT(*) WHERE auth_code ~ '^[0-9]+$'`

The aggregation invariant: wrap the regex match in WHERE to filter rows; then COUNT(*) returns the number of qualifying rows. The same pattern counts any "rows whose column matches a regex" question.

COUNT(*) WHERE col ~ pattern — count of matching rows.
COUNT(DISTINCT col) WHERE col ~ pattern — unique matching values.
COUNT(*) FILTER (WHERE col ~ pattern) (Postgres) — same as above, expressed inside the SELECT.
SUM(CASE WHEN col ~ pattern THEN 1 ELSE 0 END) — engine-portable equivalent.

Worked example. Count numeric auth codes among 5 rows.

transaction_id	auth_code
1	`12345`
2	`A1234`
3	`99999`
4	`1234X`
5	`00007`

Numeric-only count = 3 (rows 1, 3, 5).

Worked-example solution.

SELECT COUNT(*) AS numeric_count
FROM transactions
WHERE auth_code ~ '^[0-9]+$';
-- numeric_count = 3

Rule of thumb: always state your regex in plain English first ("starts, one or more digits, ends") before writing it; interviewers reward explicit explanation.

Common beginner mistakes

Forgetting the ^ and $ anchors — matches strings with digits anywhere, not "only digits."
Using [0-9]* (zero or more) and matching the empty string when the prompt forbids it.
LIKE '%[0-9]%' — Postgres LIKE doesn't do character classes; use ~.
Mixing engine syntaxes — \d works on some engines but not Postgres ~ without (?x) flags; stick to [0-9] for portability.
Missing case-sensitivity — [A-Z] and [a-z] are different; use [A-Za-z] or ~* for combined.

SQL Interview Question on Numeric Authorization Code Transactions

Table transactions(transaction_id INT, auth_code TEXT, amount NUMERIC). Return the count of transactions whose auth_code is numeric only (digits 0–9, at least one digit, nothing else). Column: numeric_count.

Solution Using `~ '^[0-9]+$'` regex with `COUNT(*)`

SELECT COUNT(*) AS numeric_count
FROM transactions
WHERE auth_code ~ '^[0-9]+$';

Step-by-step trace (input: 6 rows):

transaction_id	auth_code
1	`12345`
2	`A1234`
3	`99999`
4	`1234X`
5	`00007`
6	(empty string)

Scan transactions — read all 6 rows.
Apply regex match for each row:
- '12345' — matches ^[0-9]+$ → keep.
- 'A1234' — fails (leading A) → drop.
- '99999' — matches → keep.
- '1234X' — fails (trailing X) → drop.
- '00007' — matches → keep.
- '' — fails (+ requires at least one digit) → drop.
COUNT(*) — 3 rows survive.

Output:

numeric_count
3

Why this works — concept by concept:

^ start anchor — forces the match to begin at the first character; without it, 'A1234' would match the substring 1234 and incorrectly pass.
$ end anchor — forces the match to end at the last character; without it, '1234X' would match the prefix 1234 and incorrectly pass.
[0-9]+ character class with quantifier — exactly one or more digit characters; + excludes the empty string.
Postgres ~ operator — applies the regex per row; returns boolean usable in WHERE.
COUNT(*) — counts the surviving rows after the filter.
Cost — one scan of transactions; regex is O(L) per row where L is the auth code length → O(N · L) total time, O(1) space.

SQL
Topic — regular expressions
Regex problems

Practice →

SQL
Intuit — regular expressions
Intuit regex problems

Practice →

4. Python String Manipulation for the Largest Odd-Number Substring

Right-to-left digit-string scans for the longest odd-prefix in Python for data engineering

"Given a digit string, return the largest odd-number substring it contains" is the canonical right-to-left scan interview prompt. The mental model: the largest odd-number substring is the longest prefix of s that ends with an odd digit; scan from the right, find the first odd digit, return s[:i+1]. The same shape powers any "trim trailing X" question — trim trailing zeros, trim trailing whitespace, strip suffix.

Pro tip: "Largest odd number" means the biggest numeric value that is odd. The biggest substring of a digit string ending in an odd digit is the prefix up to and including that odd digit; longer prefixes ending in an even digit are bigger numerically but not odd. The right-to-left scan finds exactly the right cut point in one pass.

Odd-digit predicate: `s[i] in '13579'`

The parity invariant: a digit is odd iff it's one of 1, 3, 5, 7, 9. The Python idiom s[i] in '13579' tests parity in one expression with no integer parsing.

s[i] in '13579' — string-membership test; O(1) against a 5-char string.
int(s[i]) % 2 == 1 — parses then mods; works but slower.
ord(s[i]) % 2 == 1 — exploits ASCII (digits 0–9 are 48–57); fastest but unidiomatic.
s[i] in {'1','3','5','7','9'} — set membership; O(1); arguably cleanest.

Worked example. Test each character of '52468135'.

i	s[i]	odd?
0	5	yes
1	2	no
2	4	no
3	6	no
4	8	no
5	1	yes
6	3	yes
7	5	yes

Worked-example solution.

s = '52468135'
for i, ch in enumerate(s):
    print(i, ch, ch in '13579')
# 0 5 True
# 1 2 False
# 2 4 False
# 3 6 False
# 4 8 False
# 5 1 True
# 6 3 True
# 7 5 True

Rule of thumb: string-membership is the cleanest parity test for digit characters; reach for it before int().

Right-to-left scan: walk from end, return `s[:i+1]` on first odd

The scan invariant: start i = len(s) - 1, walk left while s[i] is even; the first odd s[i] is the cut point; return s[:i+1]. Right-to-left because the largest odd substring is the longest prefix ending in an odd digit — and the longest such prefix has the rightmost odd digit as its last character.

for i in range(len(s) - 1, -1, -1) — explicit reverse range.
if s[i] in '13579': return s[:i+1] — first odd from the right.
return '' — fallback when no odd digit exists.
Why s[:i+1] and not s[:i] — Python slicing is half-open; s[:i+1] includes index i.

Worked example. Apply the scan to '52468135'.

i (right-to-left)	s[i]	odd?	action
7	5	yes	return `s[:8]` = `'52468135'`

The scan finds the odd digit at i=7 immediately; result is the entire string.

Worked-example solution.

def largest_odd(s):
    for i in range(len(s) - 1, -1, -1):
        if s[i] in '13579':
            return s[:i+1]
    return ''

print(largest_odd('52468135'))   # '52468135'
print(largest_odd('5246'))       # '5'   — only first digit is odd
print(largest_odd('2468'))       # ''    — no odd digit

Rule of thumb: "longest prefix ending in X" → right-to-left scan, return s[:i+1] on first match.

Empty-string fallback when no odd digit exists

The fallback invariant: if the loop completes without returning, the input has no odd digits; the answer is the empty string. Be explicit; don't rely on the loop's implicit None return.

return '' — explicit empty-string answer.
return None — alternative if the prompt says "return None when no odd digit."
Document the fallback — comment on the choice; interviewers grade clarity.
Edge case: empty input — largest_odd('') should also return ''; the loop doesn't execute, falls through to return ''.

Worked example. Three edge inputs.

input	result
`''`	`''`
`'2468'`	`''`
`'4'`	`''`
`'5'`	`'5'`

Worked-example solution.

def largest_odd(s):
    for i in range(len(s) - 1, -1, -1):
        if s[i] in '13579':
            return s[:i+1]
    return ''

assert largest_odd('') == ''
assert largest_odd('2468') == ''
assert largest_odd('4') == ''
assert largest_odd('5') == '5'

Rule of thumb: always include an assert for the empty-input case in your worked example; interviewers grade defensive thinking.

Common beginner mistakes

Scanning left-to-right for the first odd digit — finds the leftmost odd, not the rightmost; the longest substring is the rightmost-odd prefix.
Using s[:i] instead of s[:i+1] — drops the odd digit itself.
Calling int(s[i]) % 2 — parses each character; slower than membership test and unidiomatic.
Forgetting the empty-string fallback — function returns None implicitly on no-odd-digit input.
Mutating s mid-loop or building the result with += — use slicing for O(K) result construction in one step.

Python Interview Question on the Largest Odd-Number Substring

Given a string s of digit characters, return the largest odd-number substring that can be read from s by trimming trailing characters. If no odd-number substring exists, return the empty string.

def largest_odd_number(s):
    ...

Solution Using a right-to-left scan with `s[i] in '13579'`

def largest_odd_number(s):
    for i in range(len(s) - 1, -1, -1):
        if s[i] in '13579':
            return s[:i + 1]
    return ''

Step-by-step trace (input: s = '4206538'):

index	s[i]	odd?	action
6	8	no	continue
5	3	yes	return `s[:6]` = `'420653'`

Initialize — i = 6 (last index of length-7 string).
s[6] = '8' — '8' in '13579'? No. Continue; i = 5.
s[5] = '3' — '3' in '13579'? Yes. Return s[:6] = '420653'.

Output:

return value
`'420653'`

Why this works — concept by concept:

Right-to-left scan — the largest numeric substring of s ending in an odd digit is the longest prefix that ends with the rightmost odd digit; scanning from the right finds that cut point in one pass.
s[i] in '13579' — string-membership is O(1) against a 5-character constant; the cleanest parity test for digit characters.
s[:i+1] slicing — Python slicing is half-open, so s[:i+1] includes the odd digit itself (the last character of the result).
Early return — once the first odd digit from the right is found, no further scanning is needed.
Empty-string fallback — when no odd digit exists, the loop exits naturally and return '' handles the case explicitly without relying on implicit None.
No mutation, no allocation — slicing is O(K) for the K-length result; no intermediate buffers.
Cost — O(N) worst-case time when no odd digit exists; O(1) extra space; O(K) for the result string.

PYTHON
Topic — string manipulation
String-manipulation problems

Practice →

PYTHON
Topic — regular expressions
Regex problems

Practice →

5. Python Hash Tables and Sorting for Country-Count Rollups

`Counter` rollups with deterministic tuple sort keys in Python for data engineering

"Count occurrences per country and return ranked output" is the canonical rollup interview prompt. The mental model: Counter(...) produces a frequency dict in one line; sorted(items(), key=lambda kv: (-kv[1], kv[0])) produces the ranked output sorted by descending count then ascending name on ties. The same shape powers per-key totals, frequency rankings, top-N reports.

Pro tip: Counter is purpose-built for "count occurrences" — don't reinvent it with a defaultdict(int) loop. The tuple sort key (-kv[1], kv[0]) is the canonical "value desc, key asc" pattern; memorize it because it appears in every rollup interview.

`collections.Counter` for one-line frequency rollup

The Counter invariant: Counter(iterable) produces a dict-like object mapping each unique element to its count. The construction is one line; reads are O(1); the result is iterable as (key, count) pairs via .items() or .most_common().

Counter(['a', 'b', 'a']) — produces Counter({'a': 2, 'b': 1}).
counter.items() — (key, count) pairs.
counter.most_common(n) — top-N by count, in one call.
counter[k] — count for k; returns 0 for missing keys (not KeyError).

Worked example. Six country occurrences across three countries.

countries (input)
`['US', 'CA', 'US', 'MX', 'CA', 'US']`

After Counter(countries):

country	count
US	3
CA	2
MX	1

Worked-example solution.

from collections import Counter

countries = ['US', 'CA', 'US', 'MX', 'CA', 'US']
counts = Counter(countries)
print(counts)
# Counter({'US': 3, 'CA': 2, 'MX': 1})

Rule of thumb: if the input is a flat list of keys to count, Counter is the right primitive — one line, no boilerplate.

Tuple sort key for "value desc, key asc" tie-break

The sort-key invariant: sorted(items, key=lambda kv: (-kv[1], kv[0])) sorts primarily by descending count (negate to flip the natural ascending order) and secondarily by ascending key on ties. The tuple breaks ties deterministically.

key=lambda kv: (-kv[1], kv[0]) — value desc, key asc (the default expected tie-break).
key=lambda kv: kv[1], reverse=True — same primary order; ties broken arbitrarily.
key=lambda kv: (-kv[1], -kv[0]) — value desc, key desc.
heapq.nlargest(n, items, key=lambda kv: kv[1]) — top-N by value when n << len(items).

Worked example. Two countries tied at 2; tie-break by name ascending.

country	count
US	3
CA	2
MX	2

After sorted(..., key=lambda kv: (-kv[1], kv[0])):

country	count
US	3
CA	2
MX	2

Worked-example solution.

from collections import Counter

counts = Counter({'US': 3, 'CA': 2, 'MX': 2})
ranked = sorted(counts.items(), key=lambda kv: (-kv[1], kv[0]))
print(ranked)
# [('US', 3), ('CA', 2), ('MX', 2)]

Rule of thumb: always pair a numeric desc sort with a name asc tie-break — interviewers reward the explicit secondary key, even when ties are unlikely.

Multi-field sort and top-N slicing

The composition invariant: after sorting, slice with [:N] for top-N; or use Counter.most_common(N) to do both in one call. The most_common method is the cleanest form when you don't need a custom tie-break.

sorted(...)[:N] — top-N after a custom sort.
Counter.most_common(N) — top-N built in (default tie-break is insertion order, not ascending key).
Multi-field input — Counter accepts tuples; Counter([(country, region) for ...]) rolls up per (country, region).
Convert to dict — dict(Counter(...)) for JSON serialization.

Worked example. Top-2 countries.

ranked	top-2
`[('US', 3), ('CA', 2), ('MX', 2)]`	`[('US', 3), ('CA', 2)]`

Worked-example solution.

from collections import Counter

counts = Counter({'US': 3, 'CA': 2, 'MX': 2})

# Custom tie-break + slice
top2_custom = sorted(counts.items(), key=lambda kv: (-kv[1], kv[0]))[:2]
# [('US', 3), ('CA', 2)]

# Built-in most_common (default insertion-order tie-break)
top2_builtin = counts.most_common(2)
# [('US', 3), ('CA', 2)]

Rule of thumb: use most_common(N) when the prompt doesn't specify a tie-break; switch to a custom sorted(...) with a tuple key when it does.

Common beginner mistakes

Building a defaultdict(int) loop when Counter does it in one line.
Forgetting the tie-break and getting non-deterministic output across runs.
Using key=lambda kv: -kv[1] on string values — TypeError on negation; use reverse=True for strings.
Converting Counter to list of tuples manually — use .items().
Sorting then taking the wrong end with [-N:] — [:N] after descending sort is correct; [-N:] is bottom-N.

Python Interview Question on Country-Count Rollups

You are given a list of country codes (strings). Return a list of (country, count) tuples sorted by count descending, then by country ascending on ties.

def count_countries(countries):
    ...

Solution Using `Counter` and a tuple sort key

from collections import Counter

def count_countries(countries):
    counts = Counter(countries)
    return sorted(counts.items(), key=lambda kv: (-kv[1], kv[0]))

Step-by-step trace (input: countries = ['US', 'CA', 'US', 'MX', 'CA', 'US', 'BR', 'CA']):

Counter(countries) — single pass through the list, increment per element.
- After: Counter({'CA': 3, 'US': 3, 'MX': 1, 'BR': 1}).
counts.items() — pairs: [('US', 3), ('CA', 3), ('MX', 1), ('BR', 1)] (insertion order; Python preserves it from 3.7+).
sorted(..., key=lambda kv: (-kv[1], kv[0])) —
- Primary key: -count → US (-3), CA (-3), MX (-1), BR (-1). After sort by primary: ties at -3 and ties at -1.
- Secondary key: country kv[0] ascending — at -3: CA before US; at -1: BR before MX.
- Final order: [('CA', 3), ('US', 3), ('BR', 1), ('MX', 1)].

Output:

country	count
CA	3
US	3
BR	1
MX	1

Why this works — concept by concept:

Counter(iterable) — one-line frequency rollup; iterates the input once and maintains a dict[str, int] of counts.
.items() — exposes (key, count) pairs as the input to sorted; preserves insertion order, but we'll re-sort.
-kv[1] primary key — negating the count flips ascending sort into descending; standard Python idiom for "biggest first."
kv[0] secondary key — country name ascending breaks ties deterministically.
sorted(...) — returns a new list of tuples; doesn't mutate the input.
No reverse=True — using a tuple with negated primary lets ascending and descending live together cleanly; reverse=True would flip both keys, which is wrong.
Cost — Counter is O(N); sorted is O(K log K) where K is the number of distinct countries → O(N + K log K) total.

PYTHON
Topic — hash table
Hash-table problems

Practice →

PYTHON
Topic — sorting
Sorting problems

Practice →

Tips to Crack Intuit Data Engineering Interviews

The loop is mixed SQL + Python — drill both

The curated Intuit practice set is 3 SQL + 2 Python. Skipping either language to focus on the other is a misallocation; you can fail the loop on a single Python or single SQL round. The Intuit SQL practice page and the Intuit Python practice page are the right starting surfaces.

Window-function-over-aggregate is the medium-tier ceiling

561 Rank Transaction Types by Average Expense is the only MEDIUM in the set, and the trick is recognizing that the window function runs after the `GROUP BY` aggregate. Without that mental model, candidates try to wrap a CTE around the aggregate (still works, just verbose) or — worse — try to apply `RANK()` to raw transaction rows. Drill the window-functions practice page until "RANK over AVG" is one fluid query.

Regex shows up twice — once in SQL, once in Python

563 Numeric Authorization Code Transactions uses Postgres `~ '^[0-9]+$'`; #564 Largest Odd Number in String can be solved with Python `re.findall(r'[13579]+', s)` (though the right-to-left scan covered above is cleaner). Memorize the regex flavors per engine: Postgres `~`, Snowflake `REGEXP_LIKE`, MySQL `REGEXP`, BigQuery `REGEXP_CONTAINS`. Drill the regex practice page for both languages.

Easy-tier discipline matters

Four of five Intuit problems are EASY-tier. Easy at Intuit means the interviewer expects clean, idiomatic code with no off-by-ones, every aggregate paired with the right GROUP BY or Counter, every regex anchored. Stuttering through an EASY-tier problem signals worse than failing a MEDIUM-tier one. The Intuit easy practice page is the right warmup.

Where to practice on PipeCode

Start with the Intuit practice page for the curated 5-problem set. Then drill the matching topic pages: window-functions, joins, subquery, regex, string-manipulation, hash-table, sorting. The interview courses page bundles structured SQL and Python curricula. For broad coverage, browse by topic.

Communication and approach under time pressure

Talk through the structure before writing code: "I'll group by transaction_type, compute AVG, then RANK over the average." Interviewers grade process as much as the final answer. Leave 5 minutes at the end of each problem for an edge-case sweep: empty input, ties, NULL columns, no odd digits, regex against empty strings, single-element partitions. The most common "almost passed" failure mode is correct happy-path code that crashes on empty input — a 30-second sweep prevents it.

Frequently Asked Questions

What is the Intuit data engineering interview process like?

Intuit's data engineering interview opens with a recruiter screen, then a technical phone screen with one SQL or Python coding problem, then an onsite (or virtual onsite) of four to five rounds: two coding rounds split between SQL (window functions over aggregates, JOIN + subquery, regex) and Python (string manipulation, Counter rollups), one data-modeling or system-design discussion (transactions, expenses, finance dimensional models), and one to two behavioral rounds. The coding rounds are heavy on the patterns covered in this guide.

What SQL topics does Intuit test for data engineers?

Intuit SQL interviews concentrate on three primitives that map directly to the curated set: window functions over aggregates (#561 — RANK() OVER (ORDER BY AVG(amount) DESC)), JOINs and subqueries with HAVING COUNT(*) > 1 for duplicate-detection (#562), and Postgres-flavored regex ~ '^[0-9]+$' for pattern matching (#563). Memorize these three patterns and you will recognize most Intuit SQL prompts within the first 60 seconds.

What Python topics does Intuit test for data engineers?

Intuit Python interviews concentrate on two primitives: string-manipulation right-to-left scans (#564 — find the rightmost odd digit; return the prefix) and collections.Counter rollups with tuple sort keys (#565 — count occurrences, sort by count desc / key asc). Both problems are EASY-tier; interviewers expect clean idiomatic Python with no off-by-ones.

How hard are Intuit data engineering interview questions?

The Intuit set is 4 easy + 1 medium + 0 hard. The MEDIUM is #561 (window function over aggregate) — the only problem that requires layering two SQL constructs. The four EASY problems test fundamentals: subquery + IN, regex, string scan, Counter. Stuttering on any EASY problem signals lack of fluency; correctness alone isn't enough.

Are window functions and regex common in Intuit interviews?

Yes. #561 explicitly tests RANK() over a GROUP BY aggregate. Regex appears in #563 (SQL ~ '^[0-9]+$') and as a viable alternative for #564 (Python re.findall(r'[13579]+', s)). Drill the window-functions practice page and the regex practice page.

How many Intuit practice problems should I solve before the interview?

Solve all 5 problems on the Intuit practice page end-to-end — untimed first, then timed at 25 minutes per problem. After that, broaden to 30 to 40 additional problems spread across the matching topic pages: window-functions, joins, subquery, regex, string-manipulation, hash-table, sorting. The Intuit SQL practice page, Intuit Python practice page, and Intuit easy practice page are the right surfaces for the final week of prep.

Top Intuit data engineering interview topics

1. SQL Window Functions for Ranking Transaction Types by Average Expense

Window functions over aggregates for ranked rollups in SQL for data engineering

GROUP BY transaction_type + AVG(amount) for per-type average

RANK() OVER (ORDER BY AVG(amount) DESC) for windowed ranking

Tied averages: RANK skips, DENSE_RANK doesn't

SQL Interview Question on Ranking Transaction Types by Average Expense

Solution Using a window function over a GROUP BY aggregate

2. SQL JOINs and Subqueries for Employees With the Same Salary

Group-and-filter via subquery and HAVING for duplicate-detection in SQL for data engineering

Subquery: GROUP BY salary HAVING COUNT(*) > 1

WHERE salary IN (...) filter to keep matching rows

Self-join alternative: e1.salary = e2.salary AND e1.id <> e2.id

SQL Interview Question on Employees With the Same Salary

Solution Using subquery + HAVING COUNT(*) > 1

3. SQL Regular Expressions for Numeric Authorization Codes

Anchored regex patterns for numeric-only matching in SQL for data engineering

Pattern ^[0-9]+$ — start/end anchors and digit class

Postgres ~ operator vs LIKE

Counting matches: COUNT(*) WHERE auth_code ~ '^[0-9]+$'

SQL Interview Question on Numeric Authorization Code Transactions

Solution Using ~ '^[0-9]+$' regex with COUNT(*)

4. Python String Manipulation for the Largest Odd-Number Substring

Right-to-left digit-string scans for the longest odd-prefix in Python for data engineering

Odd-digit predicate: s[i] in '13579'

Right-to-left scan: walk from end, return s[:i+1] on first odd

Empty-string fallback when no odd digit exists

Python Interview Question on the Largest Odd-Number Substring

Solution Using a right-to-left scan with s[i] in '13579'

5. Python Hash Tables and Sorting for Country-Count Rollups

Counter rollups with deterministic tuple sort keys in Python for data engineering

collections.Counter for one-line frequency rollup

Tuple sort key for "value desc, key asc" tie-break

Multi-field sort and top-N slicing

Python Interview Question on Country-Count Rollups

Solution Using Counter and a tuple sort key

Tips to Crack Intuit Data Engineering Interviews

The loop is mixed SQL + Python — drill both

Window-function-over-aggregate is the medium-tier ceiling

Regex shows up twice — once in SQL, once in Python

Easy-tier discipline matters

Where to practice on PipeCode

Communication and approach under time pressure

Frequently Asked Questions

What is the Intuit data engineering interview process like?

What SQL topics does Intuit test for data engineers?

What Python topics does Intuit test for data engineers?

How hard are Intuit data engineering interview questions?

Are window functions and regex common in Intuit interviews?

How many Intuit practice problems should I solve before the interview?

Start practicing Intuit data engineering problems

`GROUP BY transaction_type` + `AVG(amount)` for per-type average

`RANK() OVER (ORDER BY AVG(amount) DESC)` for windowed ranking

Tied averages: `RANK` skips, `DENSE_RANK` doesn't

Solution Using a window function over a `GROUP BY` aggregate

Group-and-filter via subquery and `HAVING` for duplicate-detection in SQL for data engineering

Subquery: `GROUP BY salary HAVING COUNT(*) > 1`

`WHERE salary IN (...)` filter to keep matching rows

Self-join alternative: `e1.salary = e2.salary AND e1.id <> e2.id`

Solution Using subquery + `HAVING COUNT(*) > 1`

Pattern `^[0-9]+$` — start/end anchors and digit class

Postgres `~` operator vs `LIKE`

Counting matches: `COUNT(*) WHERE auth_code ~ '^[0-9]+$'`

Solution Using `~ '^[0-9]+$'` regex with `COUNT(*)`

Odd-digit predicate: `s[i] in '13579'`

Right-to-left scan: walk from end, return `s[:i+1]` on first odd

Solution Using a right-to-left scan with `s[i] in '13579'`

`Counter` rollups with deterministic tuple sort keys in Python for data engineering

`collections.Counter` for one-line frequency rollup

Solution Using `Counter` and a tuple sort key