Walmart data engineering interview questions sit at the intersection of SQL and Python, with framings that read like real retail data: SKU codes that need parsing, campaign impressions rolled up by time bucket, stocked-vs-unstocked department joins, badge-time security alerts, access logs that need entry/exit balance checks, and BFS shortest-path queries over distribution-center grids. Expect three SQL problems (string parsing, time-series rollup, anti-join) and six Python problems (math primitives, hash-table-driven security analytics, BFS pathfinding, DP-with-BFS hybrids).
This guide walks through the eight topic clusters Walmart actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 9-problem Walmart set (2 easy, 7 medium, 0 hard)—medium-heavy without the algorithmic Hard tier of Databricks's set, so the article focuses on solid mid-difficulty fluency rather than exotic techniques.
Top Walmart data engineering interview topics
From the Walmart data engineering practice set, the eight numbered sections below follow this topic map (one row per H2):
| # | Topic (sections 1–8) | Why it shows up at Walmart |
|---|---|---|
| 1 | SQL string manipulation and pattern matching | Extract product category from SKU code—regex extraction on a fixed-format string. |
| 2 | SQL aggregation, joins, and date functions | Campaign impression time rollup—DATE_TRUNC + GROUP BY + join to a campaign dimension. |
| 3 | SQL anti-joins for missing-relationship queries | Unstocked store departments—rows in departments with no matching row in stock. |
| 4 | Python math and digit-by-digit primitives | Sum of digits—integer arithmetic primitives every interviewer expects fluency in. |
| 5 | Python hash tables for time-window security alerts | Badge time security alert—group events per badge, find rapid-succession swipes. |
| 6 | Python hash tables for access log mismatch detection | Access log mismatch detection—pair entry events to exit events, flag orphans. |
| 7 | Python BFS on grid / matrix for shortest paths | Minimum path to treasure—4-neighbor BFS on a 2D grid. |
| 8 | Python dynamic programming with BFS for treasure-finding | Find a treasure—DP-on-grid hybridized with BFS for multi-target optimization. |
Retail framing rule: Walmart prompts dress general algorithm patterns in retail data. SKU code = string with a structured prefix; campaign impressions = time-series events; stocked departments = join from a dimension to a fact; badge swipes = timestamped events grouped by entity; access logs = paired events that must balance; treasure path = BFS on a grid. Mapping the framing to the algorithm is half the win.
1. SQL String Manipulation and Pattern Matching for Retail Codes
SQL string parsing for retail product codes in data engineering
Retail systems are awash in structured strings: SKUs like SKU-CAT123-456, store codes like STORE-NW-042, promo IDs like PROMO-Q4-XMAS-2026. The interview shape is always the same: extract one field from the structured string and group / aggregate by it. Pick the lightest tool that still answers: position-based (SUBSTRING, LEFT, SPLIT_PART) for fixed formats, LIKE for wildcard filters, regex only when the format is irregular.
Pro tip: Leading-wildcard
LIKE '%CAT%'and per-row regex defeat B-tree indexes. For high-volume tables, anchor patterns at the start (LIKE 'SKU-CAT%') or generate an indexed extracted column.
SUBSTRING, LEFT, RIGHT, POSITION
Four functions cover position-based extraction. Position-based is fast but brittle: change the SKU format and the query silently extracts the wrong field.
-
LEFT(s, n)— firstncharacters;RIGHT(s, n)— lastn. -
SUBSTRING(s FROM start FOR length)— general 1-indexed slice. -
POSITION(needle IN haystack)— 1-indexed offset of first occurrence (0 if absent). -
SPLIT_PART(s, delim, n)(PostgreSQL) — the n-th delimiter-separated field; the cleanest shortcut for SKU-style strings. -
Worked example: extract category from
SKU-CAT123-456.
| function | call | result |
|---|---|---|
LEFT |
LEFT('SKU-CAT123-456', 3) |
'SKU' |
SUBSTRING |
SUBSTRING('SKU-CAT123-456' FROM 5 FOR 6) |
'CAT123' |
SPLIT_PART |
SPLIT_PART('SKU-CAT123-456', '-', 2) |
'CAT123' |
SELECT
sku,
SPLIT_PART(sku, '-', 2) AS category_code
FROM products;
Pattern matching with LIKE and SIMILAR TO
LIKE uses two wildcards (% for any-length, _ for one char) and is case-sensitive in PostgreSQL (ILIKE for case-insensitive). SIMILAR TO is a SQL-standard middle ground with regex-style alternation and quantifiers.
-
Index-friendly:
LIKE 'SKU-CAT%'(leading anchor) — uses a B-tree. -
Index-defeating:
LIKE '%CAT%'(leading wildcard) — sequential scan. -
SIMILAR TO 'SKU-(CAT|DOG)\\d+'— alternation across categories. -
Worked example:
WHERE sku LIKE 'SKU-CAT%'matches every CAT-category SKU using the index onsku.
SELECT sku, name
FROM products
WHERE sku LIKE 'SKU-CAT%';
Regex extraction with REGEXP_REPLACE / REGEXP_MATCHES
When the format is irregular (optional prefixes, variable-length fields, mixed delimiters), regex is the right tool—at the cost of speed. Use it when the format demands it, not as a default.
-
REGEXP_REPLACE(s, pattern, replacement, flags)— substitute matches. -
REGEXP_MATCHES(s, pattern, flags)—text[]array of capture groups. -
SUBSTRING(s FROM pattern)— first match of the pattern (or its first capture group if the pattern uses parentheses). -
Worked example:
SUBSTRING('SKU-CAT123-NEW' FROM '^SKU-([A-Z]+)\d+')returns'CAT'; same pattern on'SKU-DOG456'returns'DOG'.
SELECT
sku,
SUBSTRING(sku FROM '^SKU-([A-Z]+)\d+') AS category_code,
COUNT(*) AS sku_count
FROM products
GROUP BY category_code;
Common beginner mistakes
- Using
LIKE '%CAT%'and getting a sequential scan on a billion-row table—missing the index opportunity. - Forgetting
LIKEis case-sensitive in PostgreSQL—useILIKEfor case-insensitive matching. - Writing
SPLIT_PART(s, '-', 0)thinking it gives the first field—SPLIT_PARTis 1-indexed, so the first field is index 1. - Hardcoding fixed-position substrings (
SUBSTRING(sku FROM 5 FOR 3)) and breaking when the SKU format changes. - Using
REGEXP_MATCHESand forgetting it returns atext[]array—you may need to dereference with[1].
SQL interview question on string parsing and pattern matching
Table products(sku TEXT, name TEXT) contains SKUs in the format SKU-{CATEGORY}-{NUMERIC_ID} where CATEGORY is uppercase letters and NUMERIC_ID is digits. Return per-category counts of products, sorted by count descending then category alphabetically.
Solution using SPLIT_PART and GROUP BY
SELECT
SPLIT_PART(sku, '-', 2) AS category,
COUNT(*) AS product_count
FROM products
WHERE sku LIKE 'SKU-%-%'
GROUP BY category
ORDER BY product_count DESC, category ASC;
Why this works: SPLIT_PART(sku, '-', 2) returns the 2nd dash-separated field—exactly the category portion of the SKU. The WHERE sku LIKE 'SKU-%-%' predicate filters to well-formed rows with at least two dashes (defending against malformed data without using regex). GROUP BY category collapses to one row per category; COUNT(*) is the per-category multiplicity. The ORDER BY count DESC, category ASC gives deterministic output that matches typical "top categories" tie-break rules.
SQL
Topic — string manipulation
String manipulation problems
SQL
Topic — pattern matching
Pattern matching problems
2. SQL Aggregation, Joins, and Date Functions for Time-Series Rollups
SQL time-series rollups in data engineering
Time-series rollups are the most common SQL pattern in retail data engineering: per-campaign hourly impressions, per-store daily revenue. The pattern is DATE_TRUNC(unit, ts) to bucket the timestamp, GROUP BY the truncated value plus other dimensions, aggregate the metric. Walmart's rollup question layers a join to a dimension on top—classic star-schema query.
Pro tip: State the timezone convention before writing
WHERE.DATE_TRUNCoperates in the session timezone; UTC vsAmerica/Chicagoshifts day boundaries silently.
DATE_TRUNC to bucket by hour / day / week
DATE_TRUNC(unit, ts) truncates a timestamp to the start of the named unit ('hour', 'day', 'week', 'month', …). The result is still a timestamp, just zeroed below the unit boundary—every row in the bucket maps to the same left-edge timestamp.
-
Week convention —
DATE_TRUNC('week', ts)returns Monday (ISO). For Sunday-start (US retail), compute the offset manually. -
Index trap —
DATE_TRUNC(...)inWHEREdefeats the timestamp index. Prefer half-open ranges (ts >= '2026-04-28' AND ts < '2026-04-29');DATE_TRUNCinGROUP BYis fine. - Worked example: truncate to hour.
| input | DATE_TRUNC('hour', ts) |
|---|---|
'2026-04-28 09:15:30' |
'2026-04-28 09:00:00' |
'2026-04-28 09:42:00' |
'2026-04-28 09:00:00' |
'2026-04-28 10:03:15' |
'2026-04-28 10:00:00' |
SELECT
DATE_TRUNC('hour', ts) AS hour,
COUNT(*) AS events
FROM events
GROUP BY hour
ORDER BY hour;
GROUP BY on a truncated date alongside other keys
Real rollups have multiple dimensions. GROUP BY accepts as many keys as you need; the result has one row per unique combination. PostgreSQL lets you reference the SELECT alias (GROUP BY hour, campaign_id)—cleaner than re-evaluating the DATE_TRUNC expression.
- Worked example: two campaigns over two hours.
| campaign | hour | impressions |
|---|---|---|
| C1 | 09:00 | 2 |
| C1 | 10:00 | 1 |
| C2 | 09:00 | 1 |
| C2 | 10:00 | 1 |
SELECT
campaign_id,
DATE_TRUNC('hour', ts) AS hour,
COUNT(*) AS impressions
FROM impressions
GROUP BY campaign_id, hour
ORDER BY campaign_id, hour;
Joining a fact table to a campaign dimension
Real-world rollups need attributes from a dimension (campaign name, advertiser, region). Join fact-to-dimension before the GROUP BY; aggregating-then-joining usually produces the same plan but more code.
-
INNER JOINif every fact row must have a matching dimension (typical). -
LEFT JOINif orphan facts should appear with NULL dimension attributes—state the assumption. -
Worked example: the impressions table joins to a tiny campaign dim, then groups by
(campaign_id, hour).
SELECT
c.campaign_id,
c.name,
DATE_TRUNC('hour', i.ts) AS hour,
COUNT(*) AS impressions
FROM impressions i
JOIN campaigns c ON c.campaign_id = i.campaign_id
GROUP BY c.campaign_id, c.name, hour
ORDER BY c.campaign_id, hour;
Common beginner mistakes
- Filtering on
DATE_TRUNC(...)inWHEREand missing the index—use half-open ranges for predicates. - Forgetting timezone—rolls timestamps in UTC, displays them in local time, and the boundaries shift silently.
- Using
BETWEEN '2026-03-01' AND '2026-03-31'for "all of March"—BETWEENis inclusive on the upper end, missing March 31's last instant on a timestamp column. - Aggregating before joining when joining-then-aggregating produces the same result with less code.
- Forgetting to include the dimension's descriptive columns (e.g.
name) inGROUP BY—causes a "must appear in GROUP BY clause" error.
Drill SQL aggregation patterns →
SQL interview question on time-series rollups
Tables impressions(impression_id, campaign_id, ts) and campaigns(campaign_id, name, type). Return per-campaign hourly impression counts for the day 2026-04-28, including the campaign name. Sort by campaign_id then hour.
Solution using DATE_TRUNC, JOIN, and GROUP BY
SELECT
c.campaign_id,
c.name,
DATE_TRUNC('hour', i.ts) AS hour,
COUNT(*) AS impressions
FROM impressions i
JOIN campaigns c ON c.campaign_id = i.campaign_id
WHERE i.ts >= '2026-04-28'
AND i.ts < '2026-04-29'
GROUP BY c.campaign_id, c.name, hour
ORDER BY c.campaign_id, hour;
Why this works: The half-open WHERE predicate (>= '2026-04-28' AND < '2026-04-29') is index-friendly—the planner can use a B-tree index on ts and skip rows outside the day in O(log n). The JOIN attaches the campaign name; INNER JOIN is correct because every impression must have a campaign. DATE_TRUNC('hour', i.ts) buckets rows into hourly slots; GROUP BY collapses to one row per (campaign, hour). COUNT(*) is the impression metric. Total cost: index-range scan on impressions (fast), hash-join with campaigns (small dimension), hash-aggregate on the result.
SQL
Topic — aggregation
Aggregation problems
COMPANY
Walmart — aggregation
Walmart-tagged aggregation
3. SQL Anti-Joins for Missing-Relationship Queries
Anti-joins for missing-relationship queries in SQL for data engineering
The "rows in A with no matching row in B" pattern is the anti-join: products never purchased, customers without orders, machines without recent heartbeats, unstocked store departments. Three SQL formulations express the same set: LEFT JOIN ... WHERE rhs IS NULL, NOT EXISTS (correlated), and NOT IN (with the NULL trap). Default to NOT EXISTS for readability and NULL-safety.
Pro tip: The planner usually rewrites
NOT EXISTSandLEFT JOIN ... IS NULLto the same hash anti-join (O(N + M)).NOT INshort-circuits to zero rows when the subquery contains a NULL—silent correctness bug.
LEFT JOIN ... WHERE rhs IS NULL
Outer-join the left side to the right, then filter to rows where the right side is NULL—those are the unmatched left rows. Pick a non-nullable right-side column (typically the right table's primary key) for the IS NULL check; otherwise legitimate NULL right-side rows leak through.
-
Worked example:
departments=[(s1, d1), (s1, d2), (s2, d1)];stock=[(s1, d1), (s2, d1)]. Unstocked =(s1, d2).
SELECT d.store_id, d.dept_id
FROM departments d
LEFT JOIN stock s
ON s.store_id = d.store_id AND s.dept_id = d.dept_id
WHERE s.store_id IS NULL;
NOT EXISTS (correlated)
NOT EXISTS (subquery) is true when the subquery yields zero rows for the outer row. It reads naturally, short-circuits on the first match, and is NULL-safe by construction. Multi-column join predicates compose cleanly inside the subquery's WHERE.
-
Worked example: same
departments/stock; the query below returns(s1, d2)only.
SELECT d.store_id, d.dept_id
FROM departments d
WHERE NOT EXISTS (
SELECT 1
FROM stock s
WHERE s.store_id = d.store_id AND s.dept_id = d.dept_id
);
NULL trap of NOT IN
SQL's three-valued logic makes x NOT IN (a, b, NULL) evaluate to UNKNOWN for every x, because x != NULL is UNKNOWN. The WHERE clause then drops every row—silent zero-row result. If you must use NOT IN, add IS NOT NULL inside the subquery; better, use NOT EXISTS.
- Worked example: the buggy and defensive shapes side by side.
-- Returns 0 rows if any product_id in purchases is NULL!
SELECT product_id FROM products
WHERE product_id NOT IN (
SELECT product_id FROM purchases -- contains NULLs
);
-- Defensive variant
SELECT product_id FROM products
WHERE product_id NOT IN (
SELECT product_id FROM purchases
WHERE product_id IS NOT NULL
);
Common beginner mistakes
- Reaching for
NOT INon a subquery that returns NULLs—silent zero-row result. - Using
INNER JOINfor "missing rows"—logically wrong, returns the matched rows instead of the unmatched ones. - Picking a nullable right-side column for the
IS NULLfilter—accidentally includes legitimate NULL right-side rows. - Writing the join condition in
WHEREinstead ofONfor aLEFT JOIN—turns the outer join into an inner join. - Forgetting that anti-joins return rows from the left side only—right-side columns are always NULL after the filter and shouldn't be projected.
SQL interview question on anti-joins
Tables departments(store_id, dept_id) lists every (store, department) pair that should exist; stock(store_id, dept_id, qty) records current inventory. Return all (store_id, dept_id) pairs that have no inventory row at all (unstocked). Sort by store_id then dept_id.
Solution using NOT EXISTS
SELECT d.store_id, d.dept_id
FROM departments d
WHERE NOT EXISTS (
SELECT 1 FROM stock s
WHERE s.store_id = d.store_id AND s.dept_id = d.dept_id
)
ORDER BY d.store_id, d.dept_id;
Why this works: NOT EXISTS is the declarative, NULL-safe shape for "rows in departments with no matching row in stock." The correlated subquery is evaluated per outer row but short-circuits on the first matching stock row, so the per-row cost is O(1) average (with an index on (store_id, dept_id) in stock). Multi-column join predicates compose naturally inside the subquery's WHERE. Total cost: hash anti-join, O(N + M) where N is departments and M is stock. Output is just the left-side columns, sorted deterministically.
SQL
Topic — anti-join
Anti-join problems
COMPANY
Walmart — joins
Walmart-tagged joins
4. Python Math and Digit-by-Digit Primitives
Python math primitives in data engineering
Easy DE questions like "sum of digits" test whether you can write clean code under pressure—right idiom, correct edge cases, complexity awareness. Three idioms compete: modular arithmetic (% 10, // 10, O(1) memory), string conversion (sum(int(c) for c in str(n)), Pythonic), and recursive digital root (O(log* n) iterations, with a famous O(1) closed form). Pick by what the prompt asks—and mention the alternative.
Pro tip: State the negative-input contract. The modular form loops forever on
n < 0unless youabs()first; the string form raises on the leading'-'.
Integer digit extraction with % 10 and // 10
n % 10 peels the last digit; n // 10 drops it. Loop until n reaches zero. After iteration k, the running total holds the sum of the k least-significant digits. Loop runs O(log₁₀ n) times for non-negative n; abs-first to handle negatives.
-
Worked example: trace
n = 1234.
| iteration | n before | last digit (n % 10) |
total | n after (n // 10) |
|---|---|---|---|---|
| 1 | 1234 | 4 | 4 | 123 |
| 2 | 123 | 3 | 7 | 12 |
| 3 | 12 | 2 | 9 | 1 |
| 4 | 1 | 1 | 10 | 0 |
def digit_sum(n: int) -> int:
n = abs(n)
total = 0
while n > 0:
total += n % 10
n //= 10
return total
String-conversion alternative (sum(int(c) for c in str(n)))
The Pythonic one-liner: sum(int(c) for c in str(abs(n))). The abs(n) strips a leading minus sign that would otherwise trip int('-'). Memory is O(log n) for the intermediate string vs O(1) for the modular form—usually irrelevant.
-
Worked example:
n = 1234yields'1' + '2' + '3' + '4'→1 + 2 + 3 + 4 = 10.
def digit_sum(n: int) -> int:
return sum(int(c) for c in str(abs(n)))
Recursive digit sum (digital root)
The digital root is the iterative digit-sum applied until a single digit remains. Three implementations: recursive, iterative, and the O(1) closed form 1 + (n - 1) % 9 (for n > 0), which falls out of the fact that n mod 9 is preserved by the digit-sum operation.
- Recursive — clean but capped by Python's recursion limit (~10^900, never hit in practice).
-
Iterative —
while n >= 10: n = digit_sum(n). Same complexity, no recursion. -
Closed form —
1 + (n - 1) % 9forn > 0; special-casen = 0. -
Worked example:
digital_root(999)=27→9. Closed-form check:1 + 998 % 9 = 1 + 8 = 9. ✓
def digital_root_iter(n: int) -> int:
n = abs(n)
while n >= 10:
n = sum(int(c) for c in str(n))
return n
def digital_root_closed(n: int) -> int:
n = abs(n)
if n == 0:
return 0
return 1 + (n - 1) % 9
Common beginner mistakes
- Forgetting to
abs(n)on negative inputs—loop never terminates with the modular approach, orint('-')raises with the string approach. - Using
n / 10(true division, returns float) instead ofn // 10(floor division)—creates floats, breaks the digit decomposition. - Recursing without a base case for
n < 10—infinite recursion. - Hard-coding
digit_sumasn % 10 + n // 10for two-digit numbers—works for 10–99, breaks for 100+. - Assuming the closed-form digital-root identity
1 + (n - 1) % 9holds forn = 0—it gives9instead of0. Special-casen = 0.
Python interview question on math primitives
Given a non-negative integer n, return the digital root (the iterative sum-of-digits until a single digit remains). Solve in O(1) time using the closed-form identity, and provide an iterative version as a fallback.
Solution using the closed-form identity with iterative fallback
def digital_root(n: int) -> int:
if n < 0:
raise ValueError("n must be non-negative")
if n == 0:
return 0
return 1 + (n - 1) % 9
def digital_root_iter(n: int) -> int:
if n < 0:
raise ValueError("n must be non-negative")
while n >= 10:
n = sum(int(c) for c in str(n))
return n
Why this works: The closed-form identity comes from the fact that adding any number to its digit-sum produces a number divisible by 9; equivalently, n mod 9 is preserved by the digit-sum operation, so digital_root(n) ≡ n (mod 9) for n > 0. The shift 1 + (n - 1) % 9 maps n to the range [1, 9] instead of [0, 8], matching the digital-root convention. The iterative fallback is provided for clarity and as a sanity check; both produce the same answer for n > 0. The n = 0 case is special-cased because the closed-form would map 0 to 9, which is incorrect.
PYTHON
Topic — math
Math problems
PYTHON
Topic — string
String problems
5. Python Hash Tables for Time-Window Security Alerts
Hash tables for time-window security analytics in Python for data engineering
Per-entity timestamped events with a time-window predicate—"more than K events in T seconds"—is a textbook DE pattern: rate limiting, fraud detection, badge security alerts. Algorithm: group by entity with defaultdict(list), sort by timestamp within each group, two-pointer sweep to maintain a sliding window. The shape generalizes; recognize it.
Pro tip: Use
<= window_seconds(not<) when the prompt says "within T seconds" inclusive of the boundary. Off-by-one here flips whether boundary events flag.
Sort events by timestamp, scan with two pointers
When events arrive in arbitrary order, sort first within each group—the sliding window only makes sense over chronologically ordered events. Sort is O(E log E) total; the two-pointer sweep is O(E) (each pointer advances at most E times). Net O(E log E).
-
Worked example:
B1events[09:00, 09:01:30, 09:02, 09:30], 5-min window.
| right | event | window | count |
|---|---|---|---|
| 0 | 09:00 | [09:00] |
1 |
| 1 | 09:01:30 | [09:00, 09:01:30] |
2 |
| 2 | 09:02 | [09:00, 09:01:30, 09:02] |
3 |
| 3 | 09:30 | [09:30] |
1 |
def has_alert(timestamps: list[float], window_sec: float, threshold: int) -> bool:
timestamps.sort()
left = 0
for right in range(len(timestamps)):
while timestamps[right] - timestamps[left] > window_sec:
left += 1
if right - left + 1 >= threshold:
return True
return False
Sliding-window count of accesses within T seconds
The window invariant: timestamps[right] - timestamps[left] <= window_seconds. On each new right, advance left until the invariant holds, then check the window size. Each event is touched twice (right enters, left leaves), so the sweep is amortized O(E)—vs O(E²) for a naive nested loop.
-
Worked example: same events, threshold 3—first alert fires at
right = 2(09:02).
def alert_at(timestamps: list[float], window_sec: float, threshold: int) -> int | None:
timestamps.sort()
left = 0
for right in range(len(timestamps)):
while timestamps[right] - timestamps[left] > window_sec:
left += 1
if right - left + 1 >= threshold:
return right # index of triggering event
return None
Per-badge alert tracking with defaultdict(list)
Real prompts have multiple entities. Build defaultdict(list) keyed by badge, with timestamps as the value; iterate badges and apply the per-entity sliding window. Per-badge state is independent, which makes the algorithm trivially parallelizable.
- Worked example: three badges, threshold 3 in 5 minutes.
| badge | timestamps | alerts? |
|---|---|---|
| B1 | [09:00, 09:01:30, 09:02] |
yes (at 09:02) |
| B2 | [09:00, 09:30] |
no |
| B3 | [09:00, 09:00:30, 09:01, 09:01:30] |
yes (at 09:01) |
from collections import defaultdict
def find_alerts(
events: list[tuple[str, float]], window_sec: float, threshold: int
) -> list[str]:
by_badge: dict[str, list[float]] = defaultdict(list)
for badge, ts in events:
by_badge[badge].append(ts)
alerted: list[str] = []
for badge, timestamps in by_badge.items():
timestamps.sort()
left = 0
for right in range(len(timestamps)):
while timestamps[right] - timestamps[left] > window_sec:
left += 1
if right - left + 1 >= threshold:
alerted.append(badge)
break
return alerted
Common beginner mistakes
- Forgetting to sort timestamps before the sliding-window sweep—nonsense windows.
- Off-by-one on the window predicate (
<vs<=)—flags or doesn't flag boundary events. - Updating
leftwithifinstead ofwhile—doesn't advance past stale events when multiple are out of window. - Iterating the wrong axis (per-event instead of per-badge) and getting
O(E²)cross-comparisons. - Forgetting to break after the first alert per badge—reports the same badge multiple times.
Practice Walmart-style hash table problems →
Python interview question on time-window security alerts
Given a list of (badge_id, timestamp) events, a window in seconds W, and a threshold K, return the list of badge_ids that have at least K events within any sliding W-second window. Each badge appears at most once in the output. Use O(E log E) time.
Solution using defaultdict(list) plus per-badge two-pointer sweep
from collections import defaultdict
def alerted_badges(
events: list[tuple[str, float]], window_sec: float, threshold: int
) -> list[str]:
by_badge: dict[str, list[float]] = defaultdict(list)
for badge, ts in events:
by_badge[badge].append(ts)
alerted: list[str] = []
for badge, timestamps in by_badge.items():
timestamps.sort()
left = 0
for right in range(len(timestamps)):
while timestamps[right] - timestamps[left] > window_sec:
left += 1
if right - left + 1 >= threshold:
alerted.append(badge)
break
return sorted(alerted)
Why this works: Grouping events by badge with defaultdict(list) is O(E) and gives per-badge independent state. Sorting each badge's timestamps is O(g log g) per badge, summed to O(E log E) total. The two-pointer sweep is O(g) per badge, O(E) total. The break after the first alert ensures each badge appears at most once. Final sorted(alerted) produces deterministic output ordering. Total complexity O(E log E).
PYTHON
Topic — hash table
Hash table problems
COMPANY
Walmart — hash table
Walmart-tagged hash table
6. Python Hash Tables for Access Log Mismatch Detection
Access log mismatch detection in Python for data engineering
Paired events that must balance: each entry should match an exit; orphans are anomalies. Surfaces in security audits (badge-in / badge-out), transaction logs (BEGIN / COMMIT), payment systems (charge / capture), resource leaks (open / close). Two data-structure choices: Counter (increment on entry, decrement on exit) for a simple balance check, or per-entity stack (push / pop) when you must pair specific entries to specific exits.
Pro tip: §5 finds entities with too many events in a window; §6 finds entities with unmatched events. Different invariants, different structures—name the difference out loud.
Counter for entry/exit balance check
Maintain a Counter keyed by entity. Increment on entry, decrement on exit. After processing, any non-zero count is an anomaly: positive = unmatched entries, negative = unmatched exits. Order-independent, but cannot pair specific entries to specific exits.
-
Worked example: events
[(B1, entry), (B2, entry), (B1, exit), (B3, exit)].
| entity | final count | meaning |
|---|---|---|
| B1 | 0 | balanced |
| B2 | +1 | orphan entry |
| B3 | -1 | orphan exit |
from collections import Counter
def find_orphans(events: list[tuple[str, str]]) -> dict[str, int]:
balance: Counter = Counter()
for entity, kind in events:
balance[entity] += 1 if kind == "entry" else -1
return {e: c for e, c in balance.items() if c != 0}
Pairing entries to exits with a stack
When pairing matters (LIFO nested transactions, "the i-th entry matches the i-th exit"), use a per-entity stack in defaultdict(list). On entry push; on exit pop and emit a pair. Leftover stack entries are orphan entries; an exit with an empty stack is an orphan exit. For FIFO pairing, swap pop for deque.popleft.
-
Worked example: LIFO pairing of
B1events.
| event | stack after | pair emitted |
|---|---|---|
(B1, entry, 09:00) |
[09:00] |
— |
(B1, entry, 09:30) |
[09:00, 09:30] |
— |
(B1, exit, 10:00) |
[09:00] |
(09:30, 10:00) |
(B1, exit, 10:30) |
[] |
(09:00, 10:30) |
from collections import defaultdict
def pair_events(events: list[tuple[str, str, float]]) -> tuple[list, list]:
stacks: dict[str, list[float]] = defaultdict(list)
pairs: list[tuple[str, float, float]] = []
orphan_exits: list[tuple[str, float]] = []
for entity, kind, ts in events:
if kind == "entry":
stacks[entity].append(ts)
else:
if stacks[entity]:
entry_ts = stacks[entity].pop()
pairs.append((entity, entry_ts, ts))
else:
orphan_exits.append((entity, ts))
orphan_entries = [
(e, ts) for e, st in stacks.items() for ts in st
]
return pairs, orphan_entries + orphan_exits
Detecting orphan entries / orphan exits
Output is two lists with different operational meaning: orphan entries = lingering open transactions or employees still inside (active state); orphan exits = exits without an entry (data corruption). Real systems attach timestamps so the output is actionable—the stack approach preserves them naturally; the Counter approach loses them.
-
Worked example: events
[(B1, entry), (B1, entry), (B1, exit), (B2, exit)]→B1has 1 unmatched entry,B2has 1 unmatched exit.
from collections import Counter
def categorize_orphans(events: list[tuple[str, str]]) -> tuple[dict, dict]:
balance: Counter = Counter()
for entity, kind in events:
balance[entity] += 1 if kind == "entry" else -1
orphan_entries = {e: c for e, c in balance.items() if c > 0}
orphan_exits = {e: -c for e, c in balance.items() if c < 0}
return orphan_entries, orphan_exits
Common beginner mistakes
- Using a single
Counterand reporting any non-zero count as an "orphan" without distinguishing entry vs exit type. - Using a stack when the prompt only needs balance—over-engineered, slower, more code.
- Using a queue (
deque.popleft) when LIFO pairing is required—mismatched pairs. - Forgetting that
Counter[k] += 1works on a missing key (auto-zeros), butCounter[k] -= 1on a missing key creates a-1entry—not always wrong but worth knowing. - Iterating events in random order—works for Counter approach (commutative), breaks for pairing approach (which depends on chronological order).
Python interview question on access log mismatch detection
Given a chronologically ordered list of (badge_id, kind) events where kind is 'entry' or 'exit', return two dictionaries: orphan entries (badges that entered without a matching exit) and orphan exits (badges that exited without a prior entry). Use O(E) time.
Solution using a per-entity stack
from collections import defaultdict
def detect_mismatches(
events: list[tuple[str, str]]
) -> tuple[dict[str, int], dict[str, int]]:
stacks: dict[str, int] = defaultdict(int)
orphan_exits: dict[str, int] = defaultdict(int)
for badge, kind in events:
if kind == "entry":
stacks[badge] += 1
else:
if stacks[badge] > 0:
stacks[badge] -= 1
else:
orphan_exits[badge] += 1
orphan_entries = {b: c for b, c in stacks.items() if c > 0}
return orphan_entries, dict(orphan_exits)
Why this works: A per-badge counter (stacks[badge]) tracks pending entries. On exit, decrement if there's a pending entry; otherwise the exit is orphan and goes into orphan_exits. After processing all events, stacks[badge] > 0 means orphan entries (badges that entered more times than they exited). Total time O(E) for the single pass plus O(unique_badges) for the final filter. The chronological-order assumption is critical—out-of-order events would mis-classify an exit as orphan when its matching entry hasn't arrived yet.
PYTHON
Topic — hash table
Hash table problems
COMPANY
Walmart — sorting
Walmart-tagged sorting
7. Python BFS on Grid / Matrix for Shortest Paths
BFS on grid for shortest paths in Python for data engineering
Given a 2D grid where some cells are walls, find the shortest path from a start to a target. Textbook breadth-first search (BFS): visit cells in order of increasing distance; the first time you reach the target, the distance is shortest. Works on unweighted graphs with uniform-cost moves; for weighted edges use Dijkstra. Total cost O(rows * cols) time and space.
Pro tip: Always reach for
collections.deque(O(1)popleft); alistwithpop(0)isO(n)per dequeue and dies on a million-cell grid.
Grid as implicit graph: 4-neighbor adjacency
A 2D grid is naturally a graph: each cell is a node with up to four neighbors. Don't build an adjacency list—generate neighbors on demand with a precomputed offset list. Variations swap the offset list: 8-neighbor adds diagonals, knight-moves uses 8 L-shaped offsets, wraparound applies modulo arithmetic.
-
Precompute
DIRS = [(-1, 0), (1, 0), (0, -1), (0, 1)]outside the loop. -
Bounds check
0 <= nr < rows and 0 <= nc < cols. -
Walkability check
grid[nr][nc] != WALL. -
Worked example: from
(2, 2)in a 5×5 grid, the four neighbors are(1, 2), (3, 2), (2, 1), (2, 3), all in bounds.
DIRS = [(-1, 0), (1, 0), (0, -1), (0, 1)]
def neighbors(r: int, c: int, rows: int, cols: int):
for dr, dc in DIRS:
nr, nc = r + dr, c + dc
if 0 <= nr < rows and 0 <= nc < cols:
yield nr, nc
BFS with a deque for shortest-path-on-unweighted-graph
Skeleton: enqueue the start, pop the front, mark visited, enqueue unvisited neighbors. BFS visits cells in order of increasing distance, so the first time the target is reached, the path is shortest. DFS would not give this guarantee.
-
collections.dequewithappend/popleftisO(1)both ends. -
Storage — enqueue
(row, col, distance)tuples or maintain a separatedist[r][c]table. -
Worked example: 3×3 open grid, start
(0, 0), target(2, 2)—first reached at distance 4.
from collections import deque
def shortest_path(
grid: list[list[int]], start: tuple[int, int], target: tuple[int, int]
) -> int:
rows, cols = len(grid), len(grid[0])
if start == target:
return 0
visited = {start}
queue: deque = deque([(start[0], start[1], 0)])
while queue:
r, c, d = queue.popleft()
for dr, dc in DIRS:
nr, nc = r + dr, c + dc
if 0 <= nr < rows and 0 <= nc < cols and grid[nr][nc] != 1 and (nr, nc) not in visited:
if (nr, nc) == target:
return d + 1
visited.add((nr, nc))
queue.append((nr, nc, d + 1))
return -1 # unreachable
Visited set to prevent cycles
Without a visited set, BFS re-enqueues every cell from each of its neighbors—exponential blow-up. Mark cells visited at enqueue time, not at dequeue time; otherwise the same cell may be enqueued by multiple neighbors before any of them processes it.
-
Memory —
O(rows * cols)for the set; for very large grids prefer a 2Dboolarray (one byte per cell). -
Invariant — each cell is enqueued at most once; total work is
O(rows * cols). -
Worked example: without visited,
(0, 0)enqueues(0, 1)and(1, 0), each of which re-enqueues(0, 0), then(1, 1)from each—exponential. With visited, every cell is processed exactly once.
# See the BFS code above; `visited` is the set with the same name.
Common beginner mistakes
- Using DFS instead of BFS for shortest path—DFS finds a path, not the shortest.
- Using a list with
pop(0)as a queue—O(n)per dequeue, painfully slow on large grids. - Marking cells visited at dequeue time instead of enqueue time—same cell gets enqueued multiple times.
- Forgetting to return
-1(or some sentinel) when the target is unreachable—the function falls through to None and downstream code breaks. - Allocating a fresh
DIRStuple inside the inner loop instead of a module-level constant.
See more BFS and graph problems →
Python interview question on grid BFS
Given a 2D grid where 0 is walkable and 1 is a wall, plus a start (r0, c0) and a target (rT, cT), return the minimum number of 4-neighbor steps from start to target, or -1 if no path exists. Use O(rows * cols) time.
Solution using deque BFS with a visited set
from collections import deque
DIRS = [(-1, 0), (1, 0), (0, -1), (0, 1)]
def min_path(
grid: list[list[int]], start: tuple[int, int], target: tuple[int, int]
) -> int:
if start == target:
return 0
rows, cols = len(grid), len(grid[0])
visited = {start}
queue: deque = deque([(start[0], start[1], 0)])
while queue:
r, c, d = queue.popleft()
for dr, dc in DIRS:
nr, nc = r + dr, c + dc
if 0 <= nr < rows and 0 <= nc < cols and grid[nr][nc] == 0 and (nr, nc) not in visited:
if (nr, nc) == target:
return d + 1
visited.add((nr, nc))
queue.append((nr, nc, d + 1))
return -1
Why this works: BFS visits cells in order of increasing distance, so the first time the target is reached the distance is shortest. The deque gives O(1) enqueue/dequeue, keeping the per-step cost constant. The visited set ensures each cell is enqueued at most once, capping total work at O(rows * cols). The bounds and wall checks happen at enqueue time so we never queue invalid cells. Returning d + 1 on first reaching the target is correct because BFS guarantees no shorter path exists at this point.
PYTHON
Topic — BFS
BFS problems
COMPANY
Walmart — BFS
Walmart-tagged BFS
8. Python Dynamic Programming with BFS for Treasure-Finding
Dynamic programming with BFS for multi-target pathfinding in Python for data engineering
Multi-target pathfinding extends §7's BFS to "shortest path that visits all treasures." Pure BFS gives shortest-to-one-target, not optimal multi-target tours. The fix: combine BFS (for inter-target distances) with bitmask DP (for visit order)—classic small-K Traveling Salesman. Total complexity O(K² * 2^K) for K treasures, tractable up to K ≈ 20.
Pro tip: State the algorithm family before coding. "This is TSP-shaped, use bitmask DP" earns credit even if the prompt only has 2–3 treasures and a simpler answer suffices.
When to use DP-on-grid vs pure BFS
Pure BFS handles one source, one target, uniform-cost moves. DP enters when the state grows beyond position alone, or when costs are non-uniform.
- One source, one target — BFS.
- Multi-target visit-all — bitmask DP with BFS distances, OR augmented-state BFS.
- Weighted edges — Dijkstra.
- Negative weights — Bellman-Ford.
- All-pairs shortest paths — Floyd-Warshall.
-
Worked example: start
Splus treasuresT1, T2. Pure BFS finds the closer single target; bitmask DP picksmin(S→T1→T2, S→T2→T1).
# pure BFS for one target
def bfs_to_target(grid, start, target):
# ... (as in §7)
pass
# bitmask DP for multi-target — see anchor problem below
State expansion with bitmask DP for multi-target
State is (current_position, bitmask_of_collected_treasures). Each treasure has a unique bit; the mask is monotonic (once set, stays set), so the state space is bounded by rows * cols * 2^K. Edge weights are still uniform, so BFS over the augmented state space is correct for shortest paths—a "BFS on a product graph."
-
State space —
5×5grid withK = 4→25 * 16 = 400states.100×100withK = 10→10M. BeyondK = 20, switch to approximation. -
Visited key —
(row, col, mask). Same cell at different masks are distinct states. -
Goal — first dequeued state with
mask == full_mask. -
Worked example: start
(0, 0), treasures at(2, 2)(bit 0) and(4, 4)(bit 1)—reachable states include(0, 0, 0b00),(2, 2, 0b01),(4, 4, 0b11).
from collections import deque
DIRS = [(-1, 0), (1, 0), (0, -1), (0, 1)]
def shortest_collect_all(
grid: list[list[int]], start: tuple[int, int], treasures: list[tuple[int, int]]
) -> int:
rows, cols = len(grid), len(grid[0])
treasure_bit = {pos: 1 << i for i, pos in enumerate(treasures)}
full_mask = (1 << len(treasures)) - 1
start_mask = treasure_bit.get(start, 0)
queue: deque = deque([(start[0], start[1], start_mask, 0)])
visited = {(start[0], start[1], start_mask)}
while queue:
r, c, mask, d = queue.popleft()
if mask == full_mask:
return d
for dr, dc in DIRS:
nr, nc = r + dr, c + dc
if 0 <= nr < rows and 0 <= nc < cols and grid[nr][nc] != 1:
new_mask = mask | treasure_bit.get((nr, nc), 0)
state = (nr, nc, new_mask)
if state not in visited:
visited.add(state)
queue.append((nr, nc, new_mask, d + 1))
return -1
Combining BFS for distance + DP for optimal path
Alternative: precompute pairwise BFS distances between all K + 1 points (start + treasures), then run a TSP-style DP over the distance matrix. Decouples grid traversal from optimization—cleaner when K << rows * cols.
-
Phase 1 —
(K + 1)BFS runs, totalO((K + 1) * rows * cols). -
Phase 2 — TSP DP over
(visited_mask, current_index), totalO(K² * 2^K). - Trade-off — augmented-state BFS is one function but uses more memory; decoupled BFS+DP is faster on large grids with few treasures.
-
Worked example: 5×5, two treasures → 3×3 distance matrix; DP picks
min(S→T1→T2, S→T2→T1).
def shortest_tour_via_bfs_dp(grid, start, treasures):
points = [start] + treasures
K = len(treasures)
# Phase 1: pairwise BFS distances
dist = [[float("inf")] * (K + 1) for _ in range(K + 1)]
for i, src in enumerate(points):
# BFS from src, fill dist[i][j] for all j
# ... (as in §7)
pass
# Phase 2: TSP DP
INF = float("inf")
dp = [[INF] * (K + 1) for _ in range(1 << K)]
for i in range(K):
dp[1 << i][i + 1] = dist[0][i + 1] # start -> treasure i
for mask in range(1, 1 << K):
for i in range(K):
if not (mask & (1 << i)):
continue
for j in range(K):
if i == j or not (mask & (1 << j)):
continue
prev_mask = mask ^ (1 << i)
if dp[prev_mask][j + 1] + dist[j + 1][i + 1] < dp[mask][i + 1]:
dp[mask][i + 1] = dp[prev_mask][j + 1] + dist[j + 1][i + 1]
full_mask = (1 << K) - 1
return min(dp[full_mask][i + 1] for i in range(K))
Common beginner mistakes
- Using pure BFS for multi-target problems—gives one-target shortest, not multi-target optimal.
- Forgetting that the augmented state includes the bitmask—keying
visitedonly by(r, c)collapses different bitmasks into the same state and gives wrong answers. - Using DP without BFS-precomputed distances—naive
distcalculations can beO(rows*cols)per pair, multiplying the overall cost. - Off-by-one on the bitmask construction—setting bit
Kfor the K-th treasure (should be bitK - 1for 0-indexed). - Forgetting to include the start state's mask if the start happens to be on a treasure cell.
Python interview question on DP + BFS for treasure-finding
Given a grid, a start cell, and a list of K treasure cells, return the minimum number of moves to collect all treasures (in any order). Walls are impassable. Each move is to a 4-neighbor cell. Use O(rows * cols * 2^K) time.
Solution using BFS over augmented state space (row, col, mask)
from collections import deque
DIRS = [(-1, 0), (1, 0), (0, -1), (0, 1)]
def min_moves_collect_all(
grid: list[list[int]],
start: tuple[int, int],
treasures: list[tuple[int, int]],
) -> int:
rows, cols = len(grid), len(grid[0])
treasure_bit = {pos: 1 << i for i, pos in enumerate(treasures)}
full_mask = (1 << len(treasures)) - 1
if full_mask == 0:
return 0
start_mask = treasure_bit.get(start, 0)
queue: deque = deque([(start[0], start[1], start_mask, 0)])
visited = {(start[0], start[1], start_mask)}
while queue:
r, c, mask, d = queue.popleft()
if mask == full_mask:
return d
for dr, dc in DIRS:
nr, nc = r + dr, c + dc
if 0 <= nr < rows and 0 <= nc < cols and grid[nr][nc] != 1:
new_mask = mask | treasure_bit.get((nr, nc), 0)
state = (nr, nc, new_mask)
if state not in visited:
visited.add(state)
queue.append((nr, nc, new_mask, d + 1))
return -1
Why this works: BFS over the augmented state space (row, col, mask) finds the shortest path in the implicit product graph (grid × bitmask). Since edge weights are uniform (every move costs 1), BFS is correct for shortest paths. The mask is monotonic—once a treasure bit is set, it stays set—so the state space is bounded by rows * cols * 2^K. The first time a state with mask == full_mask is dequeued, its distance is the shortest tour. Total time O(rows * cols * 2^K); memory O(rows * cols * 2^K) for the visited set.
PYTHON
Topic — dynamic programming
Dynamic programming problems
COMPANY
Walmart — graph
Walmart-tagged graph
Tips to crack Walmart data engineering interviews
These are habits that move the needle in real Walmart loops—not a re-statement of the topics above.
SQL preparation: drill string parsing, time-series rollups, anti-joins
Spend most of your SQL prep on string parsing (SPLIT_PART, LIKE, regex), time-series rollups (DATE_TRUNC + GROUP BY + JOIN), and anti-joins (NOT EXISTS and LEFT JOIN ... IS NULL). Type the queries; do not just read them. The aggregation, anti-join, and string-manipulation topic pages cover the bulk.
Python preparation: hash tables, BFS, DP basics
Walmart's Python rounds reward stdlib fluency over framework knowledge. Memorize: defaultdict(list) for grouping, Counter for balance checks, collections.deque for BFS, bitmask DP for multi-target optimization. Drill problems from the hash table, BFS, and dynamic programming topic pages.
Retail-domain framing
Walmart's prompts dress general algorithms in retail data: SKU codes, campaigns, stocked departments, badges, access logs, treasure paths. The interviewer is grading whether you map the framing to the algorithm correctly. State the mapping out loud: "this is sliding window over per-entity events"; "this is anti-join via NOT EXISTS"; "this is BFS on a grid"; "this is bitmask DP for multi-target." Mapping framings to families is the meta-skill.
Where to practice on PipeCode
| Skill lane | Practice path |
|---|---|
| Curated Walmart practice set | /explore/practice/company/walmart |
| SQL aggregation | /explore/practice/topic/aggregation |
| SQL anti-join | /explore/practice/topic/anti-join |
| SQL string manipulation | /explore/practice/topic/string-manipulation |
| Python hash table | /explore/practice/topic/hash-table |
| Python BFS | /explore/practice/topic/bfs |
| Dynamic programming | /explore/practice/topic/dynamic-programming |
| All practice topics | /explore/practice/topics |
| Interview courses | /explore/courses |
Communication under time pressure
State assumptions before typing: "I'll assume SKU format is fixed SKU-{CATEGORY}-{ID}"; "I'll assume timestamps are sorted within each badge"; "I'll assume the grid uses 4-neighbor adjacency, not 8." State invariants after key code blocks. State complexity: "this is O(E log E) for the per-badge sort." Interviewers grade clear reasoning above silent-and-perfect.
Frequently Asked Questions
What is the Walmart data engineering interview process like?
The Walmart data engineering interview typically includes a phone screen (Python or SQL warm-up), one or two coding rounds focused on SQL string parsing / aggregation and Python hash-table or BFS algorithms, a system-design conversation around pipelines and retail data architectures, and behavioral interviews. The curated 9-problem Walmart practice set on PipeCode mirrors what you will see on the technical rounds.
What SQL topics does Walmart test for data engineers?
Walmart emphasizes string parsing (SKU code extraction with SPLIT_PART, LIKE, regex), time-series rollups with DATE_TRUNC + GROUP BY + JOIN, and anti-joins for missing-relationship queries (NOT EXISTS, LEFT JOIN ... IS NULL). Drill these on the aggregation, anti-join, and string-manipulation topic pages.
How important is Python for a Walmart data engineering interview?
Python is roughly two-thirds of the technical interview at Walmart—and the patterns are pipeline-realistic, not algorithm puzzles. Expect hash tables for time-window security, access log mismatch detection, BFS on grids for distribution-center pathfinding, and DP-with-BFS for multi-target tours. Stdlib fluency (collections, functools, itertools) matters more than knowing the latest framework.
Are BFS and graph problems common in the Walmart data engineering interview?
Yes—two of the curated set's nine problems are grid BFS / DP-with-BFS (treasure pathfinding). Memorize the BFS skeleton with collections.deque and a visited set, plus the augmented-state BFS for multi-target problems. The BFS topic page has problems matching this shape.
How should I prepare for Walmart's retail-flavored prompts?
Build the mental model first: every retail-flavored prompt maps to a general algorithm. SKU codes = string parsing; campaign impressions = time-series rollup; stocked vs unstocked = anti-join; badge times = sliding window over per-entity events; access logs = balance check via Counter / stack; treasure paths = grid BFS or bitmask DP. State the mapping out loud during the interview—it shows the interviewer you've internalized the pattern recognition.
How many Walmart practice problems should I solve before the interview?
Aim for 30–50 problems spanning all eight topic clusters above—not 100 of the same kind. Solve every problem in the Walmart-tagged practice set, then back-fill weak areas using the topic pages linked throughout this guide.
Start practicing Walmart data engineering problems
Reading patterns is not the same as typing them under time pressure. PipeCode pairs company-tagged Walmart problems with tests, AI feedback, and a coding environment so you can drill the exact SQL and Python patterns Walmart asks—without the noise of generic algorithm prep that doesn't apply to this loop.
Pipecode.ai is Leetcode for Data Engineering.



Top comments (0)