DEV Community

Cover image for Intuit Data Engineering Interview Questions
Gowtham Potureddi
Gowtham Potureddi

Posted on

Intuit Data Engineering Interview Questions

Intuit data engineering interview questions balance SQL and Python with a fintech analytics edge: five primitives spanning both languages — RANK() OVER (ORDER BY AVG(amount) DESC) for window-function ranking over aggregates, JOIN + subquery salary IN (SELECT salary FROM employees GROUP BY salary HAVING COUNT(*) > 1) for same-salary employees, SQL regex auth_code ~ '^[0-9]+$' for numeric-only authorization codes, Python right-to-left scan for the largest odd-number substring, and collections.Counter plus a tuple sort key for country-count rollups. The framings are finance and SaaS — transactions, expenses, salaries, authorization codes, country counts — exactly the data shapes a TurboTax / QuickBooks / Mint engineer ships every week.

This guide walks through the five topic clusters Intuit actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 5-problem Intuit set (4 easy, 1 medium, no hard) — a fundamentals-and-classics hub that rewards window-function fluency, regex pattern recognition, and clean Python rollup primitives.

Intuit data engineering interview questions cover image with bold headline, SQL and Python chips, and pipecode.ai attribution.


Top Intuit data engineering interview topics

From the Intuit data engineering practice set, the five numbered sections below follow this topic map (one row per H2):

# Topic (sections 1–5) Why it shows up at Intuit
1 SQL window functions for ranking transaction types by average expense Rank Transaction Types by Average Expense (MEDIUM) — RANK() OVER (ORDER BY AVG(amount) DESC) over a GROUP BY transaction_type aggregate.
2 SQL JOINs and subqueries for employees with the same salary Employees With Same Salary — WHERE salary IN (SELECT salary FROM employees GROUP BY salary HAVING COUNT(*) > 1).
3 SQL regular expressions for numeric authorization codes Numeric Authorization Code Transactions — WHERE auth_code ~ '^[0-9]+$'.
4 Python string manipulation for the largest odd-number substring Largest Odd Number in String — right-to-left scan for the first odd digit; return the prefix s[:i+1].
5 Python hash tables and sorting for country-count rollups Count Countries — Counter(...) plus sorted(items(), key=lambda kv: (-kv[1], kv[0])).

Fintech-analytics framing rule: Intuit's prompts span finance-domain SQL and Python — transaction-type expense rankings, salary integrity checks, authorization-code validation, digit-string parsing, country rollups. The interviewer is grading whether you map each business framing to the right primitive: ranked aggregates → window function over GROUP BY; integrity check → subquery + HAVING COUNT > 1; numeric-only → regex ^[0-9]+$; longest-odd-prefix → right-to-left scan; per-key counts → Counter. State the mapping out loud.


1. SQL Window Functions for Ranking Transaction Types by Average Expense

Window functions over aggregates for ranked rollups in SQL for data engineering

"Rank transaction types by their average expense" is the canonical window-function-over-aggregate interview prompt at Intuit. The mental model: GROUP BY transaction_type collapses to one row per type with AVG(amount); RANK() OVER (ORDER BY AVG(amount) DESC) then ranks those rows. The window function operates on the aggregated result set, not the raw rows — that's the structural distinction every candidate must articulate. The same shape powers any "rank entities by aggregated metric" query — top customers by revenue, top categories by margin, top regions by churn.

Pro tip: RANK, DENSE_RANK, and ROW_NUMBER all rank — but they differ on ties. Memorize the difference: RANK skips numbers (1, 1, 3); DENSE_RANK doesn't (1, 1, 2); ROW_NUMBER arbitrarily breaks ties (1, 2, 3 even on duplicates). For "rank with gaps," use RANK; for "rank without gaps," use DENSE_RANK; for "deterministic sequential," use ROW_NUMBER with a secondary ORDER BY tie-break.

GROUP BY transaction_type + AVG(amount) for per-type average

The aggregate invariant: GROUP BY transaction_type produces one row per type; AVG(amount) computes the mean within each group. The output is the canonical per-entity summary — the input to the window function in the next step.

  • GROUP BY transaction_type — one row per type.
  • AVG(amount) — mean amount in the group.
  • COUNT(*) — group size; useful for filtering thin groups via HAVING COUNT(*) >= N.
  • SUM(amount) — total instead of average; pick the metric the prompt asks for.

Worked example. Six transactions across three types.

transaction_type amount
groceries 50
groceries 30
rent 1200
rent 1200
dining 80
dining 60

After GROUP BY transaction_type, AVG(amount):

transaction_type avg_amount
rent 1200.00
dining 70.00
groceries 40.00

Worked-example solution.

SELECT transaction_type, AVG(amount) AS avg_amount
FROM transactions
GROUP BY transaction_type
ORDER BY avg_amount DESC;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: always pin the aggregate step in a CTE before applying a window function — it makes the structure clear to the interviewer and the planner.

RANK() OVER (ORDER BY AVG(amount) DESC) for windowed ranking

The window-rank invariant: RANK() OVER (ORDER BY AVG(amount) DESC) runs over the aggregated rows; it does not need a PARTITION BY if the entire result is one ranked list. The window function evaluates after the GROUP BY and produces one rank value per row in the aggregated output.

  • RANK() OVER (ORDER BY AVG(amount) DESC) — descending rank.
  • No PARTITION BY — single global ranking across all transaction types.
  • PARTITION BY region — one ranking per region (different question).
  • Tie-break inside ORDER BYORDER BY AVG(amount) DESC, transaction_type ASC for deterministic output on equal averages.

Worked example. Apply RANK() to the previous aggregate.

transaction_type avg_amount rank
rent 1200.00 1
dining 70.00 2
groceries 40.00 3

Worked-example solution.

SELECT
  transaction_type,
  AVG(amount) AS avg_amount,
  RANK() OVER (ORDER BY AVG(amount) DESC) AS rk
FROM transactions
GROUP BY transaction_type
ORDER BY rk;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: a window function inside the same SELECT as a GROUP BY runs after the aggregation — you can reference AVG(amount) inside the window's ORDER BY without a CTE wrapper.

Tied averages: RANK skips, DENSE_RANK doesn't

The tie invariant: two types with equal averages share a rank; RANK skips the next number, DENSE_RANK doesn't. For "rank with gaps" (1, 1, 3) use RANK; for "rank without gaps" (1, 1, 2) use DENSE_RANK. The difference matters for paginated leaderboards and any consumer that expects contiguous rank numbers.

  • RANK() — ties share, then skip.
  • DENSE_RANK() — ties share, no skip.
  • ROW_NUMBER() — every row gets a unique rank, ties broken arbitrarily.
  • Always add a tie-break in ORDER BY to make ROW_NUMBER deterministic.

Worked example. Two types tie at 70.00.

type avg RANK DENSE_RANK ROW_NUMBER
rent 1200 1 1 1
dining 70 2 2 2
takeout 70 2 2 3
groceries 40 4 3 4

Worked-example solution.

SELECT
  transaction_type,
  AVG(amount) AS avg_amount,
  RANK()       OVER (ORDER BY AVG(amount) DESC) AS rk,
  DENSE_RANK() OVER (ORDER BY AVG(amount) DESC) AS drk,
  ROW_NUMBER() OVER (ORDER BY AVG(amount) DESC, transaction_type ASC) AS rn
FROM transactions
GROUP BY transaction_type
ORDER BY rk;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: RANK is the default expected answer when the prompt says "rank" without qualification; mention DENSE_RANK and ROW_NUMBER as alternatives during your verbal walkthrough.

Common beginner mistakes

  • Trying WHERE RANK() OVER (...) = 1 — illegal; window functions can't appear in WHERE. Wrap in CTE first.
  • Forgetting DESC and getting ascending rank by default.
  • Confusing per-row rank with per-aggregate rank — clarify which one the prompt wants.
  • Skipping the tie-break and surprising the test harness with non-deterministic output.
  • Aggregating without GROUP BY and seeing only one row — every column not behind an aggregate must be in GROUP BY.

SQL Interview Question on Ranking Transaction Types by Average Expense

Table transactions(transaction_id INT, transaction_type TEXT, amount NUMERIC). Return one row per transaction_type with the average amount and the rank by that average descending. Break ties by alphabetical type ascending. Columns: transaction_type, avg_amount, rk.

Solution Using a window function over a GROUP BY aggregate

SELECT
  transaction_type,
  ROUND(AVG(amount), 2) AS avg_amount,
  RANK() OVER (ORDER BY AVG(amount) DESC) AS rk
FROM transactions
GROUP BY transaction_type
ORDER BY rk, transaction_type ASC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace (input: 8 rows across 3 types):

transaction_id transaction_type amount
1 groceries 50
2 groceries 30
3 groceries 40
4 rent 1200
5 rent 1200
6 dining 80
7 dining 60
8 dining 70
  1. Scan + groupGROUP BY transaction_type produces three groups: groceries (3 rows), rent (2 rows), dining (3 rows).
  2. Compute AVG(amount)
    • groceries: (50+30+40)/3 = 40.00
    • rent: (1200+1200)/2 = 1200.00
    • dining: (80+60+70)/3 = 70.00
  3. Apply RANK() OVER (ORDER BY AVG(amount) DESC)
    • rent (1200) → rk=1
    • dining (70) → rk=2
    • groceries (40) → rk=3
  4. ORDER BY rk, transaction_type ASC — primary by rank, secondary alphabetic on ties.

Output:

transaction_type avg_amount rk
rent 1200.00 1
dining 70.00 2
groceries 40.00 3

Why this works — concept by concept:

  • GROUP BY transaction_type — collapses raw transactions into one row per type, the canonical input shape for the window function.
  • AVG(amount) — the per-type metric; ROUND(..., 2) clamps the float for readable output.
  • RANK() OVER (ORDER BY AVG(amount) DESC) — runs after the GROUP BY; assigns rank 1 to the highest average, with ties sharing a rank and the next number skipped.
  • No PARTITION BY — there's a single global ranking across all transaction types; partitioning would create per-partition rankings.
  • Secondary tie-break in ORDER BYtransaction_type ASC makes the final output deterministic when two types share a rank.
  • Cost — one scan of transactions, one hash aggregate, one window pass over the aggregated rows → O(N) time, O(K) space where K is the number of distinct types.

SQL
Topic — window functions
Window-function problems

Practice →

SQL
Intuit — aggregation
Intuit aggregation problems

Practice →


2. SQL JOINs and Subqueries for Employees With the Same Salary

Group-and-filter via subquery and HAVING for duplicate-detection in SQL for data engineering

"Find employees who share a salary with at least one other employee" is the canonical group-and-filter interview prompt. The mental model: a subquery SELECT salary FROM employees GROUP BY salary HAVING COUNT(*) > 1 produces salary values that appear two or more times; WHERE salary IN (...) keeps employees whose salary is in that set. The same shape powers any "find rows that share a value with at least one other row" question — duplicate emails, duplicate phone numbers, shared store IDs across orders.

Pro tip: Two equivalent forms exist — subquery + IN, or self-join e1.salary = e2.salary AND e1.id <> e2.id. The subquery form is usually cleaner and faster (one GROUP BY + one filter); the self-join form is acceptable but produces extra rows that need DISTINCT. Default to the subquery form unless the interviewer specifically asks for a JOIN.

Subquery: GROUP BY salary HAVING COUNT(*) > 1

The duplicate-salary invariant: SELECT salary FROM employees GROUP BY salary HAVING COUNT(*) > 1 returns the set of salary values shared by two or more employees. HAVING filters groups (post-aggregate); WHERE filters rows (pre-aggregate) — confusing them is the most common mistake in this query family.

  • GROUP BY salary — bucket employees by salary.
  • HAVING COUNT(*) > 1 — keep only groups with two or more employees.
  • Output is a set of salary values — usable as the right side of WHERE salary IN (...).
  • HAVING vs WHEREHAVING is for predicates over aggregates; WHERE is for predicates over raw rows.

Worked example. Five employees; salaries 50k (×1), 60k (×2), 70k (×1), 80k (×2).

id name salary
1 Alice 50000
2 Bob 60000
3 Carol 60000
4 Dave 70000
5 Erin 80000
6 Frank 80000

GROUP BY salary HAVING COUNT(*) > 1 returns:

salary
60000
80000

Worked-example solution.

SELECT salary
FROM employees
GROUP BY salary
HAVING COUNT(*) > 1;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: "shared by at least N rows" → GROUP BY <field> HAVING COUNT(*) >= N. The subquery returns a list of values; feed it to the outer WHERE.

WHERE salary IN (...) filter to keep matching rows

The outer filter invariant: WHERE salary IN (subquery) returns the original rows whose salary is in the subquery's result set. This brings back full employee rows (not just salary values), which is what the prompt asks for.

  • WHERE salary IN (SELECT ... GROUP BY ... HAVING ...) — semi-join semantics.
  • WHERE EXISTS (SELECT 1 FROM ... WHERE ...) — equivalent in spirit; sometimes faster on indexed columns.
  • Order the outputORDER BY salary, name for deterministic results.
  • NOT IN caveat — be careful with NULLs in the subquery; use NOT EXISTS for robustness.

Worked example. Apply the outer filter to the previous data.

id name salary
2 Bob 60000
3 Carol 60000
5 Erin 80000
6 Frank 80000

Worked-example solution.

SELECT id, name, salary
FROM employees
WHERE salary IN (
  SELECT salary
  FROM employees
  GROUP BY salary
  HAVING COUNT(*) > 1
)
ORDER BY salary ASC, name ASC;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: the subquery defines "values that share"; the outer query brings back the rows.

Self-join alternative: e1.salary = e2.salary AND e1.id <> e2.id

The self-join invariant: alias employees as e1 and e2, join on equal salary and unequal ids; surviving e1 rows have at least one peer. Add SELECT DISTINCT e1.id, e1.name, e1.salary because each e1 row joins to one row per peer.

  • FROM employees e1 JOIN employees e2 ON e1.salary = e2.salary AND e1.id <> e2.id — pair every employee with every same-salary peer.
  • SELECT DISTINCT e1.id — collapse multiple peers per e1 to one row.
  • Performance — usually slower than the subquery form because of the row explosion before DISTINCT.
  • When to use — when you need attributes from the matching peer (e.g. peer's name, hire date).

Worked example. Same input. Self-join trace.

e1.id e1.salary e2.id e2.salary
2 60000 3 60000
3 60000 2 60000
5 80000 6 80000
6 80000 5 80000

After SELECT DISTINCT: 4 rows ({2, 3, 5, 6}).

Worked-example solution.

SELECT DISTINCT e1.id, e1.name, e1.salary
FROM employees e1
JOIN employees e2
  ON e1.salary = e2.salary AND e1.id <> e2.id
ORDER BY e1.salary ASC, e1.name ASC;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: default to the subquery form; reach for the self-join only when the interviewer asks "can you do this with a JOIN?" or when the output needs peer attributes.

Common beginner mistakes

  • Using WHERE COUNT(*) > 1 — illegal; aggregates aren't allowed in WHERE. Use HAVING.
  • Self-joining without e1.id <> e2.id — every row matches itself, every row appears.
  • Forgetting DISTINCT on the self-join — duplicate output rows when peers > 1.
  • WHERE salary = (SELECT salary ...) instead of IN — fails when the subquery returns multiple rows.
  • Skipping ORDER BY and getting non-deterministic output across runs.

SQL Interview Question on Employees With the Same Salary

Table employees(id INT, name TEXT, salary NUMERIC). Return all employees whose salary is shared with at least one other employee. Columns: id, name, salary. Sort ascending by salary, then by name.

Solution Using subquery + HAVING COUNT(*) > 1

SELECT id, name, salary
FROM employees
WHERE salary IN (
  SELECT salary
  FROM employees
  GROUP BY salary
  HAVING COUNT(*) > 1
)
ORDER BY salary ASC, name ASC;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace (input: 7 employees):

id name salary
1 Alice 50000
2 Bob 60000
3 Carol 60000
4 Dave 70000
5 Erin 80000
6 Frank 80000
7 Greg 80000
  1. Inner queryGROUP BY salary HAVING COUNT(*) > 1
    • 50000 (1) — drops.
    • 60000 (2) — keeps.
    • 70000 (1) — drops.
    • 80000 (3) — keeps. Result set: {60000, 80000}.
  2. Outer WHERE salary IN ({60000, 80000}) — keeps rows with salary in that set: ids 2, 3, 5, 6, 7.
  3. ORDER BY salary ASC, name ASC — primary salary ascending: 60k group first then 80k; secondary name ascending within each group.

Output:

id name salary
2 Bob 60000
3 Carol 60000
5 Erin 80000
6 Frank 80000
7 Greg 80000

Why this works — concept by concept:

  • Inner subquery GROUP BY salary HAVING COUNT(*) > 1 — produces exactly the salary values shared by 2+ employees; HAVING is the right clause because the predicate compares an aggregate.
  • WHERE salary IN (...) — semi-join: keeps original employee rows whose salary appears in the subquery's set; preserves all original columns.
  • No JOIN required — the subquery + IN form avoids the row-explosion + DISTINCT overhead of a self-join.
  • ORDER BY salary ASC, name ASC — deterministic output: groups are listed ascending by salary, names ascending within each group.
  • HAVING COUNT(*) > 1 — strict greater-than-one; equivalent to "at least one peer."
  • Cost — one scan of employees for the inner aggregate, one hash lookup against the result for the outer filter → O(N) time, O(K) space where K is the number of duplicated salaries.

SQL
Topic — joins
JOIN problems

Practice →

SQL
Topic — subquery
Subquery problems

Practice →


3. SQL Regular Expressions for Numeric Authorization Codes

Anchored regex patterns for numeric-only matching in SQL for data engineering

"Count transactions whose authorization code is numeric only" is the canonical regex-in-SQL interview prompt. The mental model: the regex '^[0-9]+$' matches a string that starts (^), contains one or more digits ([0-9]+), and ends ($) — with no other characters allowed. The Postgres ~ operator applies a regex; LIKE cannot enforce "all digits" cleanly, which is why regex is the right primitive.

Pro tip: Anchors matter. Without ^ and $, the regex [0-9]+ matches any string that contains at least one digit anywhere — so 'abc123' would pass. Always anchor with ^...$ when the prompt says "only" or "exactly." Drop anchors when matching a substring is enough.

Pattern ^[0-9]+$ — start/end anchors and digit class

The pattern invariant: ^ anchors to start, $ anchors to end, [0-9] is the digit character class, + is "one or more". Together they require the entire string to be one or more digits with nothing else.

  • ^ — start anchor.
  • $ — end anchor.
  • [0-9] — character class for digits 0–9.
  • + — one or more.
  • \d — synonym for [0-9] in Postgres (with ~ and the right regex flavor); [0-9] is universally portable.

Worked example. Five sample codes.

auth_code matches ^[0-9]+$?
'12345' yes
'00007' yes
'A1234' no (leading letter)
'1234X' no (trailing letter)
'' no (empty; + requires at least one digit)

Worked-example solution.

SELECT auth_code, (auth_code ~ '^[0-9]+$') AS is_numeric
FROM transactions
ORDER BY auth_code;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: ^X$ means "exactly X"; X alone means "contains X somewhere." Pick deliberately.

Postgres ~ operator vs LIKE

The operator invariant: ~ is a Postgres regex match; LIKE is a glob-style wildcard match. LIKE cannot enforce "all digits" without exhaustive character-by-character checks; ~ does it in one expression.

  • col ~ 'pattern' — Postgres regex match (case-sensitive).
  • col ~* 'pattern' — case-insensitive variant.
  • col !~ 'pattern' — does not match.
  • col LIKE 'X%' — starts with X; cannot express "only digits."
  • Cross-dialect — Snowflake REGEXP_LIKE(col, pattern), MySQL col REGEXP pattern, BigQuery REGEXP_CONTAINS(col, pattern).

Worked example. Compare ~ and LIKE for "all digits."

approach code works?
auth_code LIKE '[0-9]%' wildcard no — Postgres LIKE doesn't do character classes
auth_code SIMILAR TO '[0-9]+' SQL standard regex yes (Postgres)
auth_code ~ '^[0-9]+$' Postgres regex yes

Worked-example solution.

SELECT
  auth_code,
  (auth_code ~ '^[0-9]+$')          AS via_tilde,
  (auth_code SIMILAR TO '[0-9]+')   AS via_similar_to
FROM transactions
ORDER BY auth_code;
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: default to ~ on Postgres for portability between simple regex and capture-group extraction; SIMILAR TO is rarely used in practice.

Counting matches: COUNT(*) WHERE auth_code ~ '^[0-9]+$'

The aggregation invariant: wrap the regex match in WHERE to filter rows; then COUNT(*) returns the number of qualifying rows. The same pattern counts any "rows whose column matches a regex" question.

  • COUNT(*) WHERE col ~ pattern — count of matching rows.
  • COUNT(DISTINCT col) WHERE col ~ pattern — unique matching values.
  • COUNT(*) FILTER (WHERE col ~ pattern) (Postgres) — same as above, expressed inside the SELECT.
  • SUM(CASE WHEN col ~ pattern THEN 1 ELSE 0 END) — engine-portable equivalent.

Worked example. Count numeric auth codes among 5 rows.

transaction_id auth_code
1 12345
2 A1234
3 99999
4 1234X
5 00007

Numeric-only count = 3 (rows 1, 3, 5).

Worked-example solution.

SELECT COUNT(*) AS numeric_count
FROM transactions
WHERE auth_code ~ '^[0-9]+$';
-- numeric_count = 3
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: always state your regex in plain English first ("starts, one or more digits, ends") before writing it; interviewers reward explicit explanation.

Common beginner mistakes

  • Forgetting the ^ and $ anchors — matches strings with digits anywhere, not "only digits."
  • Using [0-9]* (zero or more) and matching the empty string when the prompt forbids it.
  • LIKE '%[0-9]%' — Postgres LIKE doesn't do character classes; use ~.
  • Mixing engine syntaxes — \d works on some engines but not Postgres ~ without (?x) flags; stick to [0-9] for portability.
  • Missing case-sensitivity — [A-Z] and [a-z] are different; use [A-Za-z] or ~* for combined.

SQL Interview Question on Numeric Authorization Code Transactions

Table transactions(transaction_id INT, auth_code TEXT, amount NUMERIC). Return the count of transactions whose auth_code is numeric only (digits 0–9, at least one digit, nothing else). Column: numeric_count.

Solution Using ~ '^[0-9]+$' regex with COUNT(*)

SELECT COUNT(*) AS numeric_count
FROM transactions
WHERE auth_code ~ '^[0-9]+$';
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace (input: 6 rows):

transaction_id auth_code
1 12345
2 A1234
3 99999
4 1234X
5 00007
6 (empty string)
  1. Scan transactions — read all 6 rows.
  2. Apply regex match for each row:
    • '12345' — matches ^[0-9]+$ → keep.
    • 'A1234' — fails (leading A) → drop.
    • '99999' — matches → keep.
    • '1234X' — fails (trailing X) → drop.
    • '00007' — matches → keep.
    • '' — fails (+ requires at least one digit) → drop.
  3. COUNT(*) — 3 rows survive.

Output:

numeric_count
3

Why this works — concept by concept:

  • ^ start anchor — forces the match to begin at the first character; without it, 'A1234' would match the substring 1234 and incorrectly pass.
  • $ end anchor — forces the match to end at the last character; without it, '1234X' would match the prefix 1234 and incorrectly pass.
  • [0-9]+ character class with quantifier — exactly one or more digit characters; + excludes the empty string.
  • Postgres ~ operator — applies the regex per row; returns boolean usable in WHERE.
  • COUNT(*) — counts the surviving rows after the filter.
  • Cost — one scan of transactions; regex is O(L) per row where L is the auth code length → O(N · L) total time, O(1) space.

SQL
Topic — regular expressions
Regex problems

Practice →

SQL
Intuit — regular expressions
Intuit regex problems

Practice →


4. Python String Manipulation for the Largest Odd-Number Substring

Right-to-left digit-string scans for the longest odd-prefix in Python for data engineering

"Given a digit string, return the largest odd-number substring it contains" is the canonical right-to-left scan interview prompt. The mental model: the largest odd-number substring is the longest prefix of s that ends with an odd digit; scan from the right, find the first odd digit, return s[:i+1]. The same shape powers any "trim trailing X" question — trim trailing zeros, trim trailing whitespace, strip suffix.

Pro tip: "Largest odd number" means the biggest numeric value that is odd. The biggest substring of a digit string ending in an odd digit is the prefix up to and including that odd digit; longer prefixes ending in an even digit are bigger numerically but not odd. The right-to-left scan finds exactly the right cut point in one pass.

Odd-digit predicate: s[i] in '13579'

The parity invariant: a digit is odd iff it's one of 1, 3, 5, 7, 9. The Python idiom s[i] in '13579' tests parity in one expression with no integer parsing.

  • s[i] in '13579' — string-membership test; O(1) against a 5-char string.
  • int(s[i]) % 2 == 1 — parses then mods; works but slower.
  • ord(s[i]) % 2 == 1 — exploits ASCII (digits 0–9 are 48–57); fastest but unidiomatic.
  • s[i] in {'1','3','5','7','9'} — set membership; O(1); arguably cleanest.

Worked example. Test each character of '52468135'.

i s[i] odd?
0 5 yes
1 2 no
2 4 no
3 6 no
4 8 no
5 1 yes
6 3 yes
7 5 yes

Worked-example solution.

s = '52468135'
for i, ch in enumerate(s):
    print(i, ch, ch in '13579')
# 0 5 True
# 1 2 False
# 2 4 False
# 3 6 False
# 4 8 False
# 5 1 True
# 6 3 True
# 7 5 True
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: string-membership is the cleanest parity test for digit characters; reach for it before int().

Right-to-left scan: walk from end, return s[:i+1] on first odd

The scan invariant: start i = len(s) - 1, walk left while s[i] is even; the first odd s[i] is the cut point; return s[:i+1]. Right-to-left because the largest odd substring is the longest prefix ending in an odd digit — and the longest such prefix has the rightmost odd digit as its last character.

  • for i in range(len(s) - 1, -1, -1) — explicit reverse range.
  • if s[i] in '13579': return s[:i+1] — first odd from the right.
  • return '' — fallback when no odd digit exists.
  • Why s[:i+1] and not s[:i] — Python slicing is half-open; s[:i+1] includes index i.

Worked example. Apply the scan to '52468135'.

i (right-to-left) s[i] odd? action
7 5 yes return s[:8] = '52468135'

The scan finds the odd digit at i=7 immediately; result is the entire string.

Worked-example solution.

def largest_odd(s):
    for i in range(len(s) - 1, -1, -1):
        if s[i] in '13579':
            return s[:i+1]
    return ''

print(largest_odd('52468135'))   # '52468135'
print(largest_odd('5246'))       # '5'   — only first digit is odd
print(largest_odd('2468'))       # ''    — no odd digit
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: "longest prefix ending in X" → right-to-left scan, return s[:i+1] on first match.

Empty-string fallback when no odd digit exists

The fallback invariant: if the loop completes without returning, the input has no odd digits; the answer is the empty string. Be explicit; don't rely on the loop's implicit None return.

  • return '' — explicit empty-string answer.
  • return None — alternative if the prompt says "return None when no odd digit."
  • Document the fallback — comment on the choice; interviewers grade clarity.
  • Edge case: empty inputlargest_odd('') should also return ''; the loop doesn't execute, falls through to return ''.

Worked example. Three edge inputs.

input result
'' ''
'2468' ''
'4' ''
'5' '5'

Worked-example solution.

def largest_odd(s):
    for i in range(len(s) - 1, -1, -1):
        if s[i] in '13579':
            return s[:i+1]
    return ''

assert largest_odd('') == ''
assert largest_odd('2468') == ''
assert largest_odd('4') == ''
assert largest_odd('5') == '5'
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: always include an assert for the empty-input case in your worked example; interviewers grade defensive thinking.

Common beginner mistakes

  • Scanning left-to-right for the first odd digit — finds the leftmost odd, not the rightmost; the longest substring is the rightmost-odd prefix.
  • Using s[:i] instead of s[:i+1] — drops the odd digit itself.
  • Calling int(s[i]) % 2 — parses each character; slower than membership test and unidiomatic.
  • Forgetting the empty-string fallback — function returns None implicitly on no-odd-digit input.
  • Mutating s mid-loop or building the result with += — use slicing for O(K) result construction in one step.

Python Interview Question on the Largest Odd-Number Substring

Given a string s of digit characters, return the largest odd-number substring that can be read from s by trimming trailing characters. If no odd-number substring exists, return the empty string.

def largest_odd_number(s):
    ...
Enter fullscreen mode Exit fullscreen mode

Solution Using a right-to-left scan with s[i] in '13579'

def largest_odd_number(s):
    for i in range(len(s) - 1, -1, -1):
        if s[i] in '13579':
            return s[:i + 1]
    return ''
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace (input: s = '4206538'):

index s[i] odd? action
6 8 no continue
5 3 yes return s[:6] = '420653'
  1. Initializei = 6 (last index of length-7 string).
  2. s[6] = '8''8' in '13579'? No. Continue; i = 5.
  3. s[5] = '3''3' in '13579'? Yes. Return s[:6] = '420653'.

Output:

return value
'420653'

Why this works — concept by concept:

  • Right-to-left scan — the largest numeric substring of s ending in an odd digit is the longest prefix that ends with the rightmost odd digit; scanning from the right finds that cut point in one pass.
  • s[i] in '13579' — string-membership is O(1) against a 5-character constant; the cleanest parity test for digit characters.
  • s[:i+1] slicing — Python slicing is half-open, so s[:i+1] includes the odd digit itself (the last character of the result).
  • Early return — once the first odd digit from the right is found, no further scanning is needed.
  • Empty-string fallback — when no odd digit exists, the loop exits naturally and return '' handles the case explicitly without relying on implicit None.
  • No mutation, no allocation — slicing is O(K) for the K-length result; no intermediate buffers.
  • CostO(N) worst-case time when no odd digit exists; O(1) extra space; O(K) for the result string.

PYTHON
Topic — string manipulation
String-manipulation problems

Practice →

PYTHON
Topic — regular expressions
Regex problems

Practice →


5. Python Hash Tables and Sorting for Country-Count Rollups

Counter rollups with deterministic tuple sort keys in Python for data engineering

"Count occurrences per country and return ranked output" is the canonical rollup interview prompt. The mental model: Counter(...) produces a frequency dict in one line; sorted(items(), key=lambda kv: (-kv[1], kv[0])) produces the ranked output sorted by descending count then ascending name on ties. The same shape powers per-key totals, frequency rankings, top-N reports.

Pro tip: Counter is purpose-built for "count occurrences" — don't reinvent it with a defaultdict(int) loop. The tuple sort key (-kv[1], kv[0]) is the canonical "value desc, key asc" pattern; memorize it because it appears in every rollup interview.

collections.Counter for one-line frequency rollup

The Counter invariant: Counter(iterable) produces a dict-like object mapping each unique element to its count. The construction is one line; reads are O(1); the result is iterable as (key, count) pairs via .items() or .most_common().

  • Counter(['a', 'b', 'a']) — produces Counter({'a': 2, 'b': 1}).
  • counter.items()(key, count) pairs.
  • counter.most_common(n) — top-N by count, in one call.
  • counter[k] — count for k; returns 0 for missing keys (not KeyError).

Worked example. Six country occurrences across three countries.

countries (input)
['US', 'CA', 'US', 'MX', 'CA', 'US']

After Counter(countries):

country count
US 3
CA 2
MX 1

Worked-example solution.

from collections import Counter

countries = ['US', 'CA', 'US', 'MX', 'CA', 'US']
counts = Counter(countries)
print(counts)
# Counter({'US': 3, 'CA': 2, 'MX': 1})
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: if the input is a flat list of keys to count, Counter is the right primitive — one line, no boilerplate.

Tuple sort key for "value desc, key asc" tie-break

The sort-key invariant: sorted(items, key=lambda kv: (-kv[1], kv[0])) sorts primarily by descending count (negate to flip the natural ascending order) and secondarily by ascending key on ties. The tuple breaks ties deterministically.

  • key=lambda kv: (-kv[1], kv[0]) — value desc, key asc (the default expected tie-break).
  • key=lambda kv: kv[1], reverse=True — same primary order; ties broken arbitrarily.
  • key=lambda kv: (-kv[1], -kv[0]) — value desc, key desc.
  • heapq.nlargest(n, items, key=lambda kv: kv[1]) — top-N by value when n << len(items).

Worked example. Two countries tied at 2; tie-break by name ascending.

country count
US 3
CA 2
MX 2

After sorted(..., key=lambda kv: (-kv[1], kv[0])):

country count
US 3
CA 2
MX 2

Worked-example solution.

from collections import Counter

counts = Counter({'US': 3, 'CA': 2, 'MX': 2})
ranked = sorted(counts.items(), key=lambda kv: (-kv[1], kv[0]))
print(ranked)
# [('US', 3), ('CA', 2), ('MX', 2)]
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: always pair a numeric desc sort with a name asc tie-break — interviewers reward the explicit secondary key, even when ties are unlikely.

Multi-field sort and top-N slicing

The composition invariant: after sorting, slice with [:N] for top-N; or use Counter.most_common(N) to do both in one call. The most_common method is the cleanest form when you don't need a custom tie-break.

  • sorted(...)[:N] — top-N after a custom sort.
  • Counter.most_common(N) — top-N built in (default tie-break is insertion order, not ascending key).
  • Multi-field inputCounter accepts tuples; Counter([(country, region) for ...]) rolls up per (country, region).
  • Convert to dictdict(Counter(...)) for JSON serialization.

Worked example. Top-2 countries.

ranked top-2
[('US', 3), ('CA', 2), ('MX', 2)] [('US', 3), ('CA', 2)]

Worked-example solution.

from collections import Counter

counts = Counter({'US': 3, 'CA': 2, 'MX': 2})

# Custom tie-break + slice
top2_custom = sorted(counts.items(), key=lambda kv: (-kv[1], kv[0]))[:2]
# [('US', 3), ('CA', 2)]

# Built-in most_common (default insertion-order tie-break)
top2_builtin = counts.most_common(2)
# [('US', 3), ('CA', 2)]
Enter fullscreen mode Exit fullscreen mode

Rule of thumb: use most_common(N) when the prompt doesn't specify a tie-break; switch to a custom sorted(...) with a tuple key when it does.

Common beginner mistakes

  • Building a defaultdict(int) loop when Counter does it in one line.
  • Forgetting the tie-break and getting non-deterministic output across runs.
  • Using key=lambda kv: -kv[1] on string values — TypeError on negation; use reverse=True for strings.
  • Converting Counter to list of tuples manually — use .items().
  • Sorting then taking the wrong end with [-N:][:N] after descending sort is correct; [-N:] is bottom-N.

Python Interview Question on Country-Count Rollups

You are given a list of country codes (strings). Return a list of (country, count) tuples sorted by count descending, then by country ascending on ties.

def count_countries(countries):
    ...
Enter fullscreen mode Exit fullscreen mode

Solution Using Counter and a tuple sort key

from collections import Counter

def count_countries(countries):
    counts = Counter(countries)
    return sorted(counts.items(), key=lambda kv: (-kv[1], kv[0]))
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace (input: countries = ['US', 'CA', 'US', 'MX', 'CA', 'US', 'BR', 'CA']):

  1. Counter(countries) — single pass through the list, increment per element.
    • After: Counter({'CA': 3, 'US': 3, 'MX': 1, 'BR': 1}).
  2. counts.items() — pairs: [('US', 3), ('CA', 3), ('MX', 1), ('BR', 1)] (insertion order; Python preserves it from 3.7+).
  3. sorted(..., key=lambda kv: (-kv[1], kv[0]))
    • Primary key: -count → US (-3), CA (-3), MX (-1), BR (-1). After sort by primary: ties at -3 and ties at -1.
    • Secondary key: country kv[0] ascending — at -3: CA before US; at -1: BR before MX.
    • Final order: [('CA', 3), ('US', 3), ('BR', 1), ('MX', 1)].

Output:

country count
CA 3
US 3
BR 1
MX 1

Why this works — concept by concept:

  • Counter(iterable) — one-line frequency rollup; iterates the input once and maintains a dict[str, int] of counts.
  • .items() — exposes (key, count) pairs as the input to sorted; preserves insertion order, but we'll re-sort.
  • -kv[1] primary key — negating the count flips ascending sort into descending; standard Python idiom for "biggest first."
  • kv[0] secondary key — country name ascending breaks ties deterministically.
  • sorted(...) — returns a new list of tuples; doesn't mutate the input.
  • No reverse=True — using a tuple with negated primary lets ascending and descending live together cleanly; reverse=True would flip both keys, which is wrong.
  • CostCounter is O(N); sorted is O(K log K) where K is the number of distinct countries → O(N + K log K) total.

PYTHON
Topic — hash table
Hash-table problems

Practice →

PYTHON
Topic — sorting
Sorting problems

Practice →


Tips to Crack Intuit Data Engineering Interviews

The loop is mixed SQL + Python — drill both

The curated Intuit practice set is 3 SQL + 2 Python. Skipping either language to focus on the other is a misallocation; you can fail the loop on a single Python or single SQL round. The Intuit SQL practice page and the Intuit Python practice page are the right starting surfaces.

Window-function-over-aggregate is the medium-tier ceiling

561 Rank Transaction Types by Average Expense is the only MEDIUM in the set, and the trick is recognizing that the window function runs after the GROUP BY aggregate. Without that mental model, candidates try to wrap a CTE around the aggregate (still works, just verbose) or — worse — try to apply RANK() to raw transaction rows. Drill the window-functions practice page until "RANK over AVG" is one fluid query.

Regex shows up twice — once in SQL, once in Python

563 Numeric Authorization Code Transactions uses Postgres ~ '^[0-9]+$'; #564 Largest Odd Number in String can be solved with Python re.findall(r'[13579]+', s) (though the right-to-left scan covered above is cleaner). Memorize the regex flavors per engine: Postgres ~, Snowflake REGEXP_LIKE, MySQL REGEXP, BigQuery REGEXP_CONTAINS. Drill the regex practice page for both languages.

Easy-tier discipline matters

Four of five Intuit problems are EASY-tier. Easy at Intuit means the interviewer expects clean, idiomatic code with no off-by-ones, every aggregate paired with the right GROUP BY or Counter, every regex anchored. Stuttering through an EASY-tier problem signals worse than failing a MEDIUM-tier one. The Intuit easy practice page is the right warmup.

Where to practice on PipeCode

Start with the Intuit practice page for the curated 5-problem set. Then drill the matching topic pages: window-functions, joins, subquery, regex, string-manipulation, hash-table, sorting. The interview courses page bundles structured SQL and Python curricula. For broad coverage, browse by topic.

Communication and approach under time pressure

Talk through the structure before writing code: "I'll group by transaction_type, compute AVG, then RANK over the average." Interviewers grade process as much as the final answer. Leave 5 minutes at the end of each problem for an edge-case sweep: empty input, ties, NULL columns, no odd digits, regex against empty strings, single-element partitions. The most common "almost passed" failure mode is correct happy-path code that crashes on empty input — a 30-second sweep prevents it.


Frequently Asked Questions

What is the Intuit data engineering interview process like?

Intuit's data engineering interview opens with a recruiter screen, then a technical phone screen with one SQL or Python coding problem, then an onsite (or virtual onsite) of four to five rounds: two coding rounds split between SQL (window functions over aggregates, JOIN + subquery, regex) and Python (string manipulation, Counter rollups), one data-modeling or system-design discussion (transactions, expenses, finance dimensional models), and one to two behavioral rounds. The coding rounds are heavy on the patterns covered in this guide.

What SQL topics does Intuit test for data engineers?

Intuit SQL interviews concentrate on three primitives that map directly to the curated set: window functions over aggregates (#561 — RANK() OVER (ORDER BY AVG(amount) DESC)), JOINs and subqueries with HAVING COUNT(*) > 1 for duplicate-detection (#562), and Postgres-flavored regex ~ '^[0-9]+$' for pattern matching (#563). Memorize these three patterns and you will recognize most Intuit SQL prompts within the first 60 seconds.

What Python topics does Intuit test for data engineers?

Intuit Python interviews concentrate on two primitives: string-manipulation right-to-left scans (#564 — find the rightmost odd digit; return the prefix) and collections.Counter rollups with tuple sort keys (#565 — count occurrences, sort by count desc / key asc). Both problems are EASY-tier; interviewers expect clean idiomatic Python with no off-by-ones.

How hard are Intuit data engineering interview questions?

The Intuit set is 4 easy + 1 medium + 0 hard. The MEDIUM is #561 (window function over aggregate) — the only problem that requires layering two SQL constructs. The four EASY problems test fundamentals: subquery + IN, regex, string scan, Counter. Stuttering on any EASY problem signals lack of fluency; correctness alone isn't enough.

Are window functions and regex common in Intuit interviews?

Yes. #561 explicitly tests RANK() over a GROUP BY aggregate. Regex appears in #563 (SQL ~ '^[0-9]+$') and as a viable alternative for #564 (Python re.findall(r'[13579]+', s)). Drill the window-functions practice page and the regex practice page.

How many Intuit practice problems should I solve before the interview?

Solve all 5 problems on the Intuit practice page end-to-end — untimed first, then timed at 25 minutes per problem. After that, broaden to 30 to 40 additional problems spread across the matching topic pages: window-functions, joins, subquery, regex, string-manipulation, hash-table, sorting. The Intuit SQL practice page, Intuit Python practice page, and Intuit easy practice page are the right surfaces for the final week of prep.


Start practicing Intuit data engineering problems

Top comments (0)