Gowtham Potureddi

Posted on Apr 28

Walmart Data Engineering Interview Questions

#dataengineering #interview #python #sql

Walmart data engineering interview questions sit at the intersection of SQL and Python, with framings that read like real retail data: SKU codes that need parsing, campaign impressions rolled up by time bucket, stocked-vs-unstocked department joins, badge-time security alerts, access logs that need entry/exit balance checks, and BFS shortest-path queries over distribution-center grids. Expect three SQL problems (string parsing, time-series rollup, anti-join) and six Python problems (math primitives, hash-table-driven security analytics, BFS pathfinding, DP-with-BFS hybrids).

This guide walks through the eight topic clusters Walmart actually tests, each with a detailed topic explanation, per-sub-topic explanation with a worked example and its solution, and an interview-style problem with a full solution that explains why it works. The mix matches the curated 9-problem Walmart set (2 easy, 7 medium, 0 hard)—medium-heavy without the algorithmic Hard tier of Databricks's set, so the article focuses on solid mid-difficulty fluency rather than exotic techniques.

#	Topic (sections 1–8)	Why it shows up at Walmart
1	SQL string manipulation and pattern matching	Extract product category from SKU code—regex extraction on a fixed-format string.
2	SQL aggregation, joins, and date functions	Campaign impression time rollup—`DATE_TRUNC` + `GROUP BY` + join to a campaign dimension.
3	SQL anti-joins for missing-relationship queries	Unstocked store departments—rows in `departments` with no matching row in `stock`.
4	Python math and digit-by-digit primitives	Sum of digits—integer arithmetic primitives every interviewer expects fluency in.
5	Python hash tables for time-window security alerts	Badge time security alert—group events per badge, find rapid-succession swipes.
6	Python hash tables for access log mismatch detection	Access log mismatch detection—pair entry events to exit events, flag orphans.
7	Python BFS on grid / matrix for shortest paths	Minimum path to treasure—4-neighbor BFS on a 2D grid.
8	Python dynamic programming with BFS for treasure-finding	Find a treasure—DP-on-grid hybridized with BFS for multi-target optimization.

1. SQL String Manipulation and Pattern Matching for Retail Codes

SQL string parsing for retail product codes in data engineering

Retail systems are awash in structured strings: SKUs like SKU-CAT123-456, store codes like STORE-NW-042, promo IDs like PROMO-Q4-XMAS-2026. The interview shape is always the same: extract one field from the structured string and group / aggregate by it. Pick the lightest tool that still answers: position-based (SUBSTRING, LEFT, SPLIT_PART) for fixed formats, LIKE for wildcard filters, regex only when the format is irregular.

Pro tip: Leading-wildcard LIKE '%CAT%' and per-row regex defeat B-tree indexes. For high-volume tables, anchor patterns at the start (LIKE 'SKU-CAT%') or generate an indexed extracted column.

`SUBSTRING`, `LEFT`, `RIGHT`, `POSITION`

Four functions cover position-based extraction. Position-based is fast but brittle: change the SKU format and the query silently extracts the wrong field.

LEFT(s, n) — first n characters; RIGHT(s, n) — last n.
SUBSTRING(s FROM start FOR length) — general 1-indexed slice.
POSITION(needle IN haystack) — 1-indexed offset of first occurrence (0 if absent).
SPLIT_PART(s, delim, n) (PostgreSQL) — the n-th delimiter-separated field; the cleanest shortcut for SKU-style strings.
Worked example: extract category from SKU-CAT123-456.

function	call	result
`LEFT`	`LEFT('SKU-CAT123-456', 3)`	`'SKU'`
`SUBSTRING`	`SUBSTRING('SKU-CAT123-456' FROM 5 FOR 6)`	`'CAT123'`
`SPLIT_PART`	`SPLIT_PART('SKU-CAT123-456', '-', 2)`	`'CAT123'`

SELECT
  sku,
  SPLIT_PART(sku, '-', 2) AS category_code
FROM products;

Pattern matching with `LIKE` and `SIMILAR TO`

LIKE uses two wildcards (% for any-length, _ for one char) and is case-sensitive in PostgreSQL (ILIKE for case-insensitive). SIMILAR TO is a SQL-standard middle ground with regex-style alternation and quantifiers.

Index-friendly: LIKE 'SKU-CAT%' (leading anchor) — uses a B-tree.
Index-defeating: LIKE '%CAT%' (leading wildcard) — sequential scan.
SIMILAR TO 'SKU-(CAT|DOG)\\d+' — alternation across categories.
Worked example: WHERE sku LIKE 'SKU-CAT%' matches every CAT-category SKU using the index on sku.

SELECT sku, name
FROM products
WHERE sku LIKE 'SKU-CAT%';

Regex extraction with `REGEXP_REPLACE` / `REGEXP_MATCHES`

When the format is irregular (optional prefixes, variable-length fields, mixed delimiters), regex is the right tool—at the cost of speed. Use it when the format demands it, not as a default.

REGEXP_REPLACE(s, pattern, replacement, flags) — substitute matches.
REGEXP_MATCHES(s, pattern, flags) — text[] array of capture groups.
SUBSTRING(s FROM pattern) — first match of the pattern (or its first capture group if the pattern uses parentheses).
Worked example: SUBSTRING('SKU-CAT123-NEW' FROM '^SKU-([A-Z]+)\d+') returns 'CAT'; same pattern on 'SKU-DOG456' returns 'DOG'.

SELECT
  sku,
  SUBSTRING(sku FROM '^SKU-([A-Z]+)\d+') AS category_code,
  COUNT(*) AS sku_count
FROM products
GROUP BY category_code;

Common beginner mistakes

Using LIKE '%CAT%' and getting a sequential scan on a billion-row table—missing the index opportunity.
Forgetting LIKE is case-sensitive in PostgreSQL—use ILIKE for case-insensitive matching.
Writing SPLIT_PART(s, '-', 0) thinking it gives the first field—SPLIT_PART is 1-indexed, so the first field is index 1.
Hardcoding fixed-position substrings (SUBSTRING(sku FROM 5 FOR 3)) and breaking when the SKU format changes.
Using REGEXP_MATCHES and forgetting it returns a text[] array—you may need to dereference with [1].

SQL interview question on string parsing and pattern matching

Table products(sku TEXT, name TEXT) contains SKUs in the format SKU-{CATEGORY}-{NUMERIC_ID} where CATEGORY is uppercase letters and NUMERIC_ID is digits. Return per-category counts of products, sorted by count descending then category alphabetically.

Solution using `SPLIT_PART` and `GROUP BY`

SELECT
  SPLIT_PART(sku, '-', 2) AS category,
  COUNT(*) AS product_count
FROM products
WHERE sku LIKE 'SKU-%-%'
GROUP BY category
ORDER BY product_count DESC, category ASC;

Why this works: SPLIT_PART(sku, '-', 2) returns the 2nd dash-separated field—exactly the category portion of the SKU. The WHERE sku LIKE 'SKU-%-%' predicate filters to well-formed rows with at least two dashes (defending against malformed data without using regex). GROUP BY category collapses to one row per category; COUNT(*) is the per-category multiplicity. The ORDER BY count DESC, category ASC gives deterministic output that matches typical "top categories" tie-break rules.

SQL
Topic — string manipulation
String manipulation problems

Practice →

SQL
Topic — pattern matching
Pattern matching problems

Practice →

2. SQL Aggregation, Joins, and Date Functions for Time-Series Rollups

SQL time-series rollups in data engineering

Time-series rollups are the most common SQL pattern in retail data engineering: per-campaign hourly impressions, per-store daily revenue. The pattern is DATE_TRUNC(unit, ts) to bucket the timestamp, GROUP BY the truncated value plus other dimensions, aggregate the metric. Walmart's rollup question layers a join to a dimension on top—classic star-schema query.

Pro tip: State the timezone convention before writing WHERE. DATE_TRUNC operates in the session timezone; UTC vs America/Chicago shifts day boundaries silently.

`DATE_TRUNC` to bucket by hour / day / week

DATE_TRUNC(unit, ts) truncates a timestamp to the start of the named unit ('hour', 'day', 'week', 'month', …). The result is still a timestamp, just zeroed below the unit boundary—every row in the bucket maps to the same left-edge timestamp.

Week convention — DATE_TRUNC('week', ts) returns Monday (ISO). For Sunday-start (US retail), compute the offset manually.
Index trap — DATE_TRUNC(...) in WHERE defeats the timestamp index. Prefer half-open ranges (ts >= '2026-04-28' AND ts < '2026-04-29'); DATE_TRUNC in GROUP BY is fine.
Worked example: truncate to hour.

input	`DATE_TRUNC('hour', ts)`
`'2026-04-28 09:15:30'`	`'2026-04-28 09:00:00'`
`'2026-04-28 09:42:00'`	`'2026-04-28 09:00:00'`
`'2026-04-28 10:03:15'`	`'2026-04-28 10:00:00'`

SELECT
  DATE_TRUNC('hour', ts) AS hour,
  COUNT(*) AS events
FROM events
GROUP BY hour
ORDER BY hour;

`GROUP BY` on a truncated date alongside other keys

Real rollups have multiple dimensions. GROUP BY accepts as many keys as you need; the result has one row per unique combination. PostgreSQL lets you reference the SELECT alias (GROUP BY hour, campaign_id)—cleaner than re-evaluating the DATE_TRUNC expression.

Worked example: two campaigns over two hours.

campaign	hour	impressions
C1	09:00	2
C1	10:00	1
C2	09:00	1
C2	10:00	1

SELECT
  campaign_id,
  DATE_TRUNC('hour', ts) AS hour,
  COUNT(*) AS impressions
FROM impressions
GROUP BY campaign_id, hour
ORDER BY campaign_id, hour;

Joining a fact table to a campaign dimension

Real-world rollups need attributes from a dimension (campaign name, advertiser, region). Join fact-to-dimension before the GROUP BY; aggregating-then-joining usually produces the same plan but more code.

INNER JOIN if every fact row must have a matching dimension (typical).
LEFT JOIN if orphan facts should appear with NULL dimension attributes—state the assumption.
Worked example: the impressions table joins to a tiny campaign dim, then groups by (campaign_id, hour).

SELECT
  c.campaign_id,
  c.name,
  DATE_TRUNC('hour', i.ts) AS hour,
  COUNT(*) AS impressions
FROM impressions i
JOIN campaigns c ON c.campaign_id = i.campaign_id
GROUP BY c.campaign_id, c.name, hour
ORDER BY c.campaign_id, hour;

Common beginner mistakes

Filtering on DATE_TRUNC(...) in WHERE and missing the index—use half-open ranges for predicates.
Forgetting timezone—rolls timestamps in UTC, displays them in local time, and the boundaries shift silently.
Using BETWEEN '2026-03-01' AND '2026-03-31' for "all of March"—BETWEEN is inclusive on the upper end, missing March 31's last instant on a timestamp column.
Aggregating before joining when joining-then-aggregating produces the same result with less code.
Forgetting to include the dimension's descriptive columns (e.g. name) in GROUP BY—causes a "must appear in GROUP BY clause" error.

Drill SQL aggregation patterns →

SQL interview question on time-series rollups

Tables impressions(impression_id, campaign_id, ts) and campaigns(campaign_id, name, type). Return per-campaign hourly impression counts for the day 2026-04-28, including the campaign name. Sort by campaign_id then hour.

Solution using `DATE_TRUNC`, `JOIN`, and `GROUP BY`

SELECT
  c.campaign_id,
  c.name,
  DATE_TRUNC('hour', i.ts) AS hour,
  COUNT(*) AS impressions
FROM impressions i
JOIN campaigns c ON c.campaign_id = i.campaign_id
WHERE i.ts >= '2026-04-28'
  AND i.ts <  '2026-04-29'
GROUP BY c.campaign_id, c.name, hour
ORDER BY c.campaign_id, hour;

Why this works: The half-open WHERE predicate (>= '2026-04-28' AND < '2026-04-29') is index-friendly—the planner can use a B-tree index on ts and skip rows outside the day in O(log n). The JOIN attaches the campaign name; INNER JOIN is correct because every impression must have a campaign. DATE_TRUNC('hour', i.ts) buckets rows into hourly slots; GROUP BY collapses to one row per (campaign, hour). COUNT(*) is the impression metric. Total cost: index-range scan on impressions (fast), hash-join with campaigns (small dimension), hash-aggregate on the result.

SQL
Topic — aggregation
Aggregation problems

Practice →

COMPANY
Walmart — aggregation
Walmart-tagged aggregation

Practice →

3. SQL Anti-Joins for Missing-Relationship Queries

Anti-joins for missing-relationship queries in SQL for data engineering

The "rows in A with no matching row in B" pattern is the anti-join: products never purchased, customers without orders, machines without recent heartbeats, unstocked store departments. Three SQL formulations express the same set: LEFT JOIN ... WHERE rhs IS NULL, NOT EXISTS (correlated), and NOT IN (with the NULL trap). Default to NOT EXISTS for readability and NULL-safety.

Pro tip: The planner usually rewrites NOT EXISTS and LEFT JOIN ... IS NULL to the same hash anti-join (O(N + M)). NOT IN short-circuits to zero rows when the subquery contains a NULL—silent correctness bug.

`LEFT JOIN ... WHERE rhs IS NULL`

Outer-join the left side to the right, then filter to rows where the right side is NULL—those are the unmatched left rows. Pick a non-nullable right-side column (typically the right table's primary key) for the IS NULL check; otherwise legitimate NULL right-side rows leak through.

Worked example: departments = [(s1, d1), (s1, d2), (s2, d1)]; stock = [(s1, d1), (s2, d1)]. Unstocked = (s1, d2).

SELECT d.store_id, d.dept_id
FROM departments d
LEFT JOIN stock s
  ON s.store_id = d.store_id AND s.dept_id = d.dept_id
WHERE s.store_id IS NULL;

`NOT EXISTS` (correlated)

NOT EXISTS (subquery) is true when the subquery yields zero rows for the outer row. It reads naturally, short-circuits on the first match, and is NULL-safe by construction. Multi-column join predicates compose cleanly inside the subquery's WHERE.

Worked example: same departments / stock; the query below returns (s1, d2) only.

SELECT d.store_id, d.dept_id
FROM departments d
WHERE NOT EXISTS (
  SELECT 1
  FROM stock s
  WHERE s.store_id = d.store_id AND s.dept_id = d.dept_id
);

NULL trap of `NOT IN`

SQL's three-valued logic makes x NOT IN (a, b, NULL) evaluate to UNKNOWN for every x, because x != NULL is UNKNOWN. The WHERE clause then drops every row—silent zero-row result. If you must use NOT IN, add IS NOT NULL inside the subquery; better, use NOT EXISTS.

Worked example: the buggy and defensive shapes side by side.

-- Returns 0 rows if any product_id in purchases is NULL!
SELECT product_id FROM products
WHERE product_id NOT IN (
  SELECT product_id FROM purchases  -- contains NULLs
);

-- Defensive variant
SELECT product_id FROM products
WHERE product_id NOT IN (
  SELECT product_id FROM purchases
  WHERE product_id IS NOT NULL
);

Common beginner mistakes

Reaching for NOT IN on a subquery that returns NULLs—silent zero-row result.
Using INNER JOIN for "missing rows"—logically wrong, returns the matched rows instead of the unmatched ones.
Picking a nullable right-side column for the IS NULL filter—accidentally includes legitimate NULL right-side rows.
Writing the join condition in WHERE instead of ON for a LEFT JOIN—turns the outer join into an inner join.
Forgetting that anti-joins return rows from the left side only—right-side columns are always NULL after the filter and shouldn't be projected.

SQL interview question on anti-joins

Tables departments(store_id, dept_id) lists every (store, department) pair that should exist; stock(store_id, dept_id, qty) records current inventory. Return all (store_id, dept_id) pairs that have no inventory row at all (unstocked). Sort by store_id then dept_id.

Solution using `NOT EXISTS`

SELECT d.store_id, d.dept_id
FROM departments d
WHERE NOT EXISTS (
  SELECT 1 FROM stock s
  WHERE s.store_id = d.store_id AND s.dept_id = d.dept_id
)
ORDER BY d.store_id, d.dept_id;

Why this works: NOT EXISTS is the declarative, NULL-safe shape for "rows in departments with no matching row in stock." The correlated subquery is evaluated per outer row but short-circuits on the first matching stock row, so the per-row cost is O(1) average (with an index on (store_id, dept_id) in stock). Multi-column join predicates compose naturally inside the subquery's WHERE. Total cost: hash anti-join, O(N + M) where N is departments and M is stock. Output is just the left-side columns, sorted deterministically.

SQL
Topic — anti-join
Anti-join problems

Practice →

COMPANY
Walmart — joins
Walmart-tagged joins

Practice →

4. Python Math and Digit-by-Digit Primitives

Python math primitives in data engineering

Easy DE questions like "sum of digits" test whether you can write clean code under pressure—right idiom, correct edge cases, complexity awareness. Three idioms compete: modular arithmetic (% 10, // 10, O(1) memory), string conversion (sum(int(c) for c in str(n)), Pythonic), and recursive digital root (O(log* n) iterations, with a famous O(1) closed form). Pick by what the prompt asks—and mention the alternative.

Pro tip: State the negative-input contract. The modular form loops forever on n < 0 unless you abs() first; the string form raises on the leading '-'.

Integer digit extraction with `% 10` and `// 10`

n % 10 peels the last digit; n // 10 drops it. Loop until n reaches zero. After iteration k, the running total holds the sum of the k least-significant digits. Loop runs O(log₁₀ n) times for non-negative n; abs-first to handle negatives.

Worked example: trace n = 1234.

iteration	n before	last digit (`n % 10`)	total	n after (`n // 10`)
1	1234	4	4	123
2	123	3	7	12
3	12	2	9	1
4	1	1	10	0

def digit_sum(n: int) -> int:
    n = abs(n)
    total = 0
    while n > 0:
        total += n % 10
        n //= 10
    return total

String-conversion alternative (`sum(int(c) for c in str(n))`)

The Pythonic one-liner: sum(int(c) for c in str(abs(n))). The abs(n) strips a leading minus sign that would otherwise trip int('-'). Memory is O(log n) for the intermediate string vs O(1) for the modular form—usually irrelevant.

Worked example: n = 1234 yields '1' + '2' + '3' + '4' → 1 + 2 + 3 + 4 = 10.

def digit_sum(n: int) -> int:
    return sum(int(c) for c in str(abs(n)))

Recursive digit sum (digital root)

The digital root is the iterative digit-sum applied until a single digit remains. Three implementations: recursive, iterative, and the O(1) closed form 1 + (n - 1) % 9 (for n > 0), which falls out of the fact that n mod 9 is preserved by the digit-sum operation.

Recursive — clean but capped by Python's recursion limit (~10^900, never hit in practice).
Iterative — while n >= 10: n = digit_sum(n). Same complexity, no recursion.
Closed form — 1 + (n - 1) % 9 for n > 0; special-case n = 0.
Worked example: digital_root(999) = 27 → 9. Closed-form check: 1 + 998 % 9 = 1 + 8 = 9. ✓

def digital_root_iter(n: int) -> int:
    n = abs(n)
    while n >= 10:
        n = sum(int(c) for c in str(n))
    return n

def digital_root_closed(n: int) -> int:
    n = abs(n)
    if n == 0:
        return 0
    return 1 + (n - 1) % 9

Common beginner mistakes

Forgetting to abs(n) on negative inputs—loop never terminates with the modular approach, or int('-') raises with the string approach.
Using n / 10 (true division, returns float) instead of n // 10 (floor division)—creates floats, breaks the digit decomposition.
Recursing without a base case for n < 10—infinite recursion.
Hard-coding digit_sum as n % 10 + n // 10 for two-digit numbers—works for 10–99, breaks for 100+.
Assuming the closed-form digital-root identity 1 + (n - 1) % 9 holds for n = 0—it gives 9 instead of 0. Special-case n = 0.

Python interview question on math primitives

Given a non-negative integer n, return the digital root (the iterative sum-of-digits until a single digit remains). Solve in O(1) time using the closed-form identity, and provide an iterative version as a fallback.

Solution using the closed-form identity with iterative fallback

def digital_root(n: int) -> int:
    if n < 0:
        raise ValueError("n must be non-negative")
    if n == 0:
        return 0
    return 1 + (n - 1) % 9

def digital_root_iter(n: int) -> int:
    if n < 0:
        raise ValueError("n must be non-negative")
    while n >= 10:
        n = sum(int(c) for c in str(n))
    return n

Why this works: The closed-form identity comes from the fact that adding any number to its digit-sum produces a number divisible by 9; equivalently, n mod 9 is preserved by the digit-sum operation, so digital_root(n) ≡ n (mod 9) for n > 0. The shift 1 + (n - 1) % 9 maps n to the range [1, 9] instead of [0, 8], matching the digital-root convention. The iterative fallback is provided for clarity and as a sanity check; both produce the same answer for n > 0. The n = 0 case is special-cased because the closed-form would map 0 to 9, which is incorrect.

PYTHON
Topic — math
Math problems

Practice →

PYTHON
Topic — string
String problems

Practice →

5. Python Hash Tables for Time-Window Security Alerts

Hash tables for time-window security analytics in Python for data engineering

Per-entity timestamped events with a time-window predicate—"more than K events in T seconds"—is a textbook DE pattern: rate limiting, fraud detection, badge security alerts. Algorithm: group by entity with defaultdict(list), sort by timestamp within each group, two-pointer sweep to maintain a sliding window. The shape generalizes; recognize it.

Pro tip: Use <= window_seconds (not <) when the prompt says "within T seconds" inclusive of the boundary. Off-by-one here flips whether boundary events flag.

Sort events by timestamp, scan with two pointers

When events arrive in arbitrary order, sort first within each group—the sliding window only makes sense over chronologically ordered events. Sort is O(E log E) total; the two-pointer sweep is O(E) (each pointer advances at most E times). Net O(E log E).

Worked example: B1 events [09:00, 09:01:30, 09:02, 09:30], 5-min window.

right	event	window	count
0	09:00	`[09:00]`	1
1	09:01:30	`[09:00, 09:01:30]`	2
2	09:02	`[09:00, 09:01:30, 09:02]`	3
3	09:30	`[09:30]`	1

def has_alert(timestamps: list[float], window_sec: float, threshold: int) -> bool:
    timestamps.sort()
    left = 0
    for right in range(len(timestamps)):
        while timestamps[right] - timestamps[left] > window_sec:
            left += 1
        if right - left + 1 >= threshold:
            return True
    return False

Sliding-window count of accesses within T seconds

The window invariant: timestamps[right] - timestamps[left] <= window_seconds. On each new right, advance left until the invariant holds, then check the window size. Each event is touched twice (right enters, left leaves), so the sweep is amortized O(E)—vs O(E²) for a naive nested loop.

Worked example: same events, threshold 3—first alert fires at right = 2 (09:02).

def alert_at(timestamps: list[float], window_sec: float, threshold: int) -> int | None:
    timestamps.sort()
    left = 0
    for right in range(len(timestamps)):
        while timestamps[right] - timestamps[left] > window_sec:
            left += 1
        if right - left + 1 >= threshold:
            return right  # index of triggering event
    return None

Per-badge alert tracking with `defaultdict(list)`

Real prompts have multiple entities. Build defaultdict(list) keyed by badge, with timestamps as the value; iterate badges and apply the per-entity sliding window. Per-badge state is independent, which makes the algorithm trivially parallelizable.

Worked example: three badges, threshold 3 in 5 minutes.

badge	timestamps	alerts?
B1	`[09:00, 09:01:30, 09:02]`	yes (at 09:02)
B2	`[09:00, 09:30]`	no
B3	`[09:00, 09:00:30, 09:01, 09:01:30]`	yes (at 09:01)

from collections import defaultdict

def find_alerts(
    events: list[tuple[str, float]], window_sec: float, threshold: int
) -> list[str]:
    by_badge: dict[str, list[float]] = defaultdict(list)
    for badge, ts in events:
        by_badge[badge].append(ts)
    alerted: list[str] = []
    for badge, timestamps in by_badge.items():
        timestamps.sort()
        left = 0
        for right in range(len(timestamps)):
            while timestamps[right] - timestamps[left] > window_sec:
                left += 1
            if right - left + 1 >= threshold:
                alerted.append(badge)
                break
    return alerted

Common beginner mistakes

Forgetting to sort timestamps before the sliding-window sweep—nonsense windows.
Off-by-one on the window predicate (< vs <=)—flags or doesn't flag boundary events.
Updating left with if instead of while—doesn't advance past stale events when multiple are out of window.
Iterating the wrong axis (per-event instead of per-badge) and getting O(E²) cross-comparisons.
Forgetting to break after the first alert per badge—reports the same badge multiple times.

Practice Walmart-style hash table problems →

Python interview question on time-window security alerts

Given a list of (badge_id, timestamp) events, a window in seconds W, and a threshold K, return the list of badge_ids that have at least K events within any sliding W-second window. Each badge appears at most once in the output. Use O(E log E) time.

Solution using `defaultdict(list)` plus per-badge two-pointer sweep

from collections import defaultdict

def alerted_badges(
    events: list[tuple[str, float]], window_sec: float, threshold: int
) -> list[str]:
    by_badge: dict[str, list[float]] = defaultdict(list)
    for badge, ts in events:
        by_badge[badge].append(ts)
    alerted: list[str] = []
    for badge, timestamps in by_badge.items():
        timestamps.sort()
        left = 0
        for right in range(len(timestamps)):
            while timestamps[right] - timestamps[left] > window_sec:
                left += 1
            if right - left + 1 >= threshold:
                alerted.append(badge)
                break
    return sorted(alerted)

Why this works: Grouping events by badge with defaultdict(list) is O(E) and gives per-badge independent state. Sorting each badge's timestamps is O(g log g) per badge, summed to O(E log E) total. The two-pointer sweep is O(g) per badge, O(E) total. The break after the first alert ensures each badge appears at most once. Final sorted(alerted) produces deterministic output ordering. Total complexity O(E log E).

PYTHON
Topic — hash table
Hash table problems

Practice →

COMPANY
Walmart — hash table
Walmart-tagged hash table

Practice →

6. Python Hash Tables for Access Log Mismatch Detection

Access log mismatch detection in Python for data engineering

Paired events that must balance: each entry should match an exit; orphans are anomalies. Surfaces in security audits (badge-in / badge-out), transaction logs (BEGIN / COMMIT), payment systems (charge / capture), resource leaks (open / close). Two data-structure choices: Counter (increment on entry, decrement on exit) for a simple balance check, or per-entity stack (push / pop) when you must pair specific entries to specific exits.

Pro tip: §5 finds entities with too many events in a window; §6 finds entities with unmatched events. Different invariants, different structures—name the difference out loud.

`Counter` for entry/exit balance check

Maintain a Counter keyed by entity. Increment on entry, decrement on exit. After processing, any non-zero count is an anomaly: positive = unmatched entries, negative = unmatched exits. Order-independent, but cannot pair specific entries to specific exits.

Worked example: events [(B1, entry), (B2, entry), (B1, exit), (B3, exit)].

entity	final count	meaning
B1	0	balanced
B2	+1	orphan entry
B3	-1	orphan exit

from collections import Counter

def find_orphans(events: list[tuple[str, str]]) -> dict[str, int]:
    balance: Counter = Counter()
    for entity, kind in events:
        balance[entity] += 1 if kind == "entry" else -1
    return {e: c for e, c in balance.items() if c != 0}

Pairing entries to exits with a stack

When pairing matters (LIFO nested transactions, "the i-th entry matches the i-th exit"), use a per-entity stack in defaultdict(list). On entry push; on exit pop and emit a pair. Leftover stack entries are orphan entries; an exit with an empty stack is an orphan exit. For FIFO pairing, swap pop for deque.popleft.

Worked example: LIFO pairing of B1 events.

event	stack after	pair emitted
`(B1, entry, 09:00)`	`[09:00]`	—
`(B1, entry, 09:30)`	`[09:00, 09:30]`	—
`(B1, exit, 10:00)`	`[09:00]`	`(09:30, 10:00)`
`(B1, exit, 10:30)`	`[]`	`(09:00, 10:30)`

from collections import defaultdict

def pair_events(events: list[tuple[str, str, float]]) -> tuple[list, list]:
    stacks: dict[str, list[float]] = defaultdict(list)
    pairs: list[tuple[str, float, float]] = []
    orphan_exits: list[tuple[str, float]] = []
    for entity, kind, ts in events:
        if kind == "entry":
            stacks[entity].append(ts)
        else:
            if stacks[entity]:
                entry_ts = stacks[entity].pop()
                pairs.append((entity, entry_ts, ts))
            else:
                orphan_exits.append((entity, ts))
    orphan_entries = [
        (e, ts) for e, st in stacks.items() for ts in st
    ]
    return pairs, orphan_entries + orphan_exits

Detecting orphan entries / orphan exits

Output is two lists with different operational meaning: orphan entries = lingering open transactions or employees still inside (active state); orphan exits = exits without an entry (data corruption). Real systems attach timestamps so the output is actionable—the stack approach preserves them naturally; the Counter approach loses them.

Worked example: events [(B1, entry), (B1, entry), (B1, exit), (B2, exit)] → B1 has 1 unmatched entry, B2 has 1 unmatched exit.

from collections import Counter

def categorize_orphans(events: list[tuple[str, str]]) -> tuple[dict, dict]:
    balance: Counter = Counter()
    for entity, kind in events:
        balance[entity] += 1 if kind == "entry" else -1
    orphan_entries = {e: c for e, c in balance.items() if c > 0}
    orphan_exits = {e: -c for e, c in balance.items() if c < 0}
    return orphan_entries, orphan_exits

Common beginner mistakes

Using a single Counter and reporting any non-zero count as an "orphan" without distinguishing entry vs exit type.
Using a stack when the prompt only needs balance—over-engineered, slower, more code.
Using a queue (deque.popleft) when LIFO pairing is required—mismatched pairs.
Forgetting that Counter[k] += 1 works on a missing key (auto-zeros), but Counter[k] -= 1 on a missing key creates a -1 entry—not always wrong but worth knowing.
Iterating events in random order—works for Counter approach (commutative), breaks for pairing approach (which depends on chronological order).

Python interview question on access log mismatch detection

Given a chronologically ordered list of (badge_id, kind) events where kind is 'entry' or 'exit', return two dictionaries: orphan entries (badges that entered without a matching exit) and orphan exits (badges that exited without a prior entry). Use O(E) time.

Solution using a per-entity stack

from collections import defaultdict

def detect_mismatches(
    events: list[tuple[str, str]]
) -> tuple[dict[str, int], dict[str, int]]:
    stacks: dict[str, int] = defaultdict(int)
    orphan_exits: dict[str, int] = defaultdict(int)
    for badge, kind in events:
        if kind == "entry":
            stacks[badge] += 1
        else:
            if stacks[badge] > 0:
                stacks[badge] -= 1
            else:
                orphan_exits[badge] += 1
    orphan_entries = {b: c for b, c in stacks.items() if c > 0}
    return orphan_entries, dict(orphan_exits)

Why this works: A per-badge counter (stacks[badge]) tracks pending entries. On exit, decrement if there's a pending entry; otherwise the exit is orphan and goes into orphan_exits. After processing all events, stacks[badge] > 0 means orphan entries (badges that entered more times than they exited). Total time O(E) for the single pass plus O(unique_badges) for the final filter. The chronological-order assumption is critical—out-of-order events would mis-classify an exit as orphan when its matching entry hasn't arrived yet.

PYTHON
Topic — hash table
Hash table problems

Practice →

COMPANY
Walmart — sorting
Walmart-tagged sorting

Practice →

7. Python BFS on Grid / Matrix for Shortest Paths

BFS on grid for shortest paths in Python for data engineering

Given a 2D grid where some cells are walls, find the shortest path from a start to a target. Textbook breadth-first search (BFS): visit cells in order of increasing distance; the first time you reach the target, the distance is shortest. Works on unweighted graphs with uniform-cost moves; for weighted edges use Dijkstra. Total cost O(rows * cols) time and space.

Pro tip: Always reach for collections.deque (O(1) popleft); a list with pop(0) is O(n) per dequeue and dies on a million-cell grid.

Grid as implicit graph: 4-neighbor adjacency

A 2D grid is naturally a graph: each cell is a node with up to four neighbors. Don't build an adjacency list—generate neighbors on demand with a precomputed offset list. Variations swap the offset list: 8-neighbor adds diagonals, knight-moves uses 8 L-shaped offsets, wraparound applies modulo arithmetic.

Precompute DIRS = [(-1, 0), (1, 0), (0, -1), (0, 1)] outside the loop.
Bounds check 0 <= nr < rows and 0 <= nc < cols.
Walkability check grid[nr][nc] != WALL.
Worked example: from (2, 2) in a 5×5 grid, the four neighbors are (1, 2), (3, 2), (2, 1), (2, 3), all in bounds.

DIRS = [(-1, 0), (1, 0), (0, -1), (0, 1)]

def neighbors(r: int, c: int, rows: int, cols: int):
    for dr, dc in DIRS:
        nr, nc = r + dr, c + dc
        if 0 <= nr < rows and 0 <= nc < cols:
            yield nr, nc

BFS with a `deque` for shortest-path-on-unweighted-graph

Skeleton: enqueue the start, pop the front, mark visited, enqueue unvisited neighbors. BFS visits cells in order of increasing distance, so the first time the target is reached, the path is shortest. DFS would not give this guarantee.

collections.deque with append / popleft is O(1) both ends.
Storage — enqueue (row, col, distance) tuples or maintain a separate dist[r][c] table.
Worked example: 3×3 open grid, start (0, 0), target (2, 2)—first reached at distance 4.

from collections import deque

def shortest_path(
    grid: list[list[int]], start: tuple[int, int], target: tuple[int, int]
) -> int:
    rows, cols = len(grid), len(grid[0])
    if start == target:
        return 0
    visited = {start}
    queue: deque = deque([(start[0], start[1], 0)])
    while queue:
        r, c, d = queue.popleft()
        for dr, dc in DIRS:
            nr, nc = r + dr, c + dc
            if 0 <= nr < rows and 0 <= nc < cols and grid[nr][nc] != 1 and (nr, nc) not in visited:
                if (nr, nc) == target:
                    return d + 1
                visited.add((nr, nc))
                queue.append((nr, nc, d + 1))
    return -1  # unreachable

Visited set to prevent cycles

Without a visited set, BFS re-enqueues every cell from each of its neighbors—exponential blow-up. Mark cells visited at enqueue time, not at dequeue time; otherwise the same cell may be enqueued by multiple neighbors before any of them processes it.

Memory — O(rows * cols) for the set; for very large grids prefer a 2D bool array (one byte per cell).
Invariant — each cell is enqueued at most once; total work is O(rows * cols).
Worked example: without visited, (0, 0) enqueues (0, 1) and (1, 0), each of which re-enqueues (0, 0), then (1, 1) from each—exponential. With visited, every cell is processed exactly once.

# See the BFS code above; `visited` is the set with the same name.

Common beginner mistakes

Using DFS instead of BFS for shortest path—DFS finds a path, not the shortest.
Using a list with pop(0) as a queue—O(n) per dequeue, painfully slow on large grids.
Marking cells visited at dequeue time instead of enqueue time—same cell gets enqueued multiple times.
Forgetting to return -1 (or some sentinel) when the target is unreachable—the function falls through to None and downstream code breaks.
Allocating a fresh DIRS tuple inside the inner loop instead of a module-level constant.

See more BFS and graph problems →

Python interview question on grid BFS

Given a 2D grid where 0 is walkable and 1 is a wall, plus a start (r0, c0) and a target (rT, cT), return the minimum number of 4-neighbor steps from start to target, or -1 if no path exists. Use O(rows * cols) time.

Solution using `deque` BFS with a visited set

from collections import deque

DIRS = [(-1, 0), (1, 0), (0, -1), (0, 1)]

def min_path(
    grid: list[list[int]], start: tuple[int, int], target: tuple[int, int]
) -> int:
    if start == target:
        return 0
    rows, cols = len(grid), len(grid[0])
    visited = {start}
    queue: deque = deque([(start[0], start[1], 0)])
    while queue:
        r, c, d = queue.popleft()
        for dr, dc in DIRS:
            nr, nc = r + dr, c + dc
            if 0 <= nr < rows and 0 <= nc < cols and grid[nr][nc] == 0 and (nr, nc) not in visited:
                if (nr, nc) == target:
                    return d + 1
                visited.add((nr, nc))
                queue.append((nr, nc, d + 1))
    return -1

Why this works: BFS visits cells in order of increasing distance, so the first time the target is reached the distance is shortest. The deque gives O(1) enqueue/dequeue, keeping the per-step cost constant. The visited set ensures each cell is enqueued at most once, capping total work at O(rows * cols). The bounds and wall checks happen at enqueue time so we never queue invalid cells. Returning d + 1 on first reaching the target is correct because BFS guarantees no shorter path exists at this point.

PYTHON
Topic — BFS
BFS problems

Practice →

COMPANY
Walmart — BFS
Walmart-tagged BFS

Practice →

8. Python Dynamic Programming with BFS for Treasure-Finding

Dynamic programming with BFS for multi-target pathfinding in Python for data engineering

Multi-target pathfinding extends §7's BFS to "shortest path that visits all treasures." Pure BFS gives shortest-to-one-target, not optimal multi-target tours. The fix: combine BFS (for inter-target distances) with bitmask DP (for visit order)—classic small-K Traveling Salesman. Total complexity O(K² * 2^K) for K treasures, tractable up to K ≈ 20.

Pro tip: State the algorithm family before coding. "This is TSP-shaped, use bitmask DP" earns credit even if the prompt only has 2–3 treasures and a simpler answer suffices.

When to use DP-on-grid vs pure BFS

Pure BFS handles one source, one target, uniform-cost moves. DP enters when the state grows beyond position alone, or when costs are non-uniform.

One source, one target — BFS.
Multi-target visit-all — bitmask DP with BFS distances, OR augmented-state BFS.
Weighted edges — Dijkstra.
Negative weights — Bellman-Ford.
All-pairs shortest paths — Floyd-Warshall.
Worked example: start S plus treasures T1, T2. Pure BFS finds the closer single target; bitmask DP picks min(S→T1→T2, S→T2→T1).

# pure BFS for one target
def bfs_to_target(grid, start, target):
    # ... (as in §7)
    pass

# bitmask DP for multi-target — see anchor problem below

State expansion with bitmask DP for multi-target

State is (current_position, bitmask_of_collected_treasures). Each treasure has a unique bit; the mask is monotonic (once set, stays set), so the state space is bounded by rows * cols * 2^K. Edge weights are still uniform, so BFS over the augmented state space is correct for shortest paths—a "BFS on a product graph."

State space — 5×5 grid with K = 4 → 25 * 16 = 400 states. 100×100 with K = 10 → 10M. Beyond K = 20, switch to approximation.
Visited key — (row, col, mask). Same cell at different masks are distinct states.
Goal — first dequeued state with mask == full_mask.
Worked example: start (0, 0), treasures at (2, 2) (bit 0) and (4, 4) (bit 1)—reachable states include (0, 0, 0b00), (2, 2, 0b01), (4, 4, 0b11).

from collections import deque

DIRS = [(-1, 0), (1, 0), (0, -1), (0, 1)]

def shortest_collect_all(
    grid: list[list[int]], start: tuple[int, int], treasures: list[tuple[int, int]]
) -> int:
    rows, cols = len(grid), len(grid[0])
    treasure_bit = {pos: 1 << i for i, pos in enumerate(treasures)}
    full_mask = (1 << len(treasures)) - 1
    start_mask = treasure_bit.get(start, 0)
    queue: deque = deque([(start[0], start[1], start_mask, 0)])
    visited = {(start[0], start[1], start_mask)}
    while queue:
        r, c, mask, d = queue.popleft()
        if mask == full_mask:
            return d
        for dr, dc in DIRS:
            nr, nc = r + dr, c + dc
            if 0 <= nr < rows and 0 <= nc < cols and grid[nr][nc] != 1:
                new_mask = mask | treasure_bit.get((nr, nc), 0)
                state = (nr, nc, new_mask)
                if state not in visited:
                    visited.add(state)
                    queue.append((nr, nc, new_mask, d + 1))
    return -1

Combining BFS for distance + DP for optimal path

Alternative: precompute pairwise BFS distances between all K + 1 points (start + treasures), then run a TSP-style DP over the distance matrix. Decouples grid traversal from optimization—cleaner when K << rows * cols.

Phase 1 — (K + 1) BFS runs, total O((K + 1) * rows * cols).
Phase 2 — TSP DP over (visited_mask, current_index), total O(K² * 2^K).
Trade-off — augmented-state BFS is one function but uses more memory; decoupled BFS+DP is faster on large grids with few treasures.
Worked example: 5×5, two treasures → 3×3 distance matrix; DP picks min(S→T1→T2, S→T2→T1).

def shortest_tour_via_bfs_dp(grid, start, treasures):
    points = [start] + treasures
    K = len(treasures)
    # Phase 1: pairwise BFS distances
    dist = [[float("inf")] * (K + 1) for _ in range(K + 1)]
    for i, src in enumerate(points):
        # BFS from src, fill dist[i][j] for all j
        # ... (as in §7)
        pass
    # Phase 2: TSP DP
    INF = float("inf")
    dp = [[INF] * (K + 1) for _ in range(1 << K)]
    for i in range(K):
        dp[1 << i][i + 1] = dist[0][i + 1]  # start -> treasure i
    for mask in range(1, 1 << K):
        for i in range(K):
            if not (mask & (1 << i)):
                continue
            for j in range(K):
                if i == j or not (mask & (1 << j)):
                    continue
                prev_mask = mask ^ (1 << i)
                if dp[prev_mask][j + 1] + dist[j + 1][i + 1] < dp[mask][i + 1]:
                    dp[mask][i + 1] = dp[prev_mask][j + 1] + dist[j + 1][i + 1]
    full_mask = (1 << K) - 1
    return min(dp[full_mask][i + 1] for i in range(K))

Common beginner mistakes

Using pure BFS for multi-target problems—gives one-target shortest, not multi-target optimal.
Forgetting that the augmented state includes the bitmask—keying visited only by (r, c) collapses different bitmasks into the same state and gives wrong answers.
Using DP without BFS-precomputed distances—naive dist calculations can be O(rows*cols) per pair, multiplying the overall cost.
Off-by-one on the bitmask construction—setting bit K for the K-th treasure (should be bit K - 1 for 0-indexed).
Forgetting to include the start state's mask if the start happens to be on a treasure cell.

Python interview question on DP + BFS for treasure-finding

Given a grid, a start cell, and a list of K treasure cells, return the minimum number of moves to collect all treasures (in any order). Walls are impassable. Each move is to a 4-neighbor cell. Use O(rows * cols * 2^K) time.

Solution using BFS over augmented state space `(row, col, mask)`

from collections import deque

DIRS = [(-1, 0), (1, 0), (0, -1), (0, 1)]

def min_moves_collect_all(
    grid: list[list[int]],
    start: tuple[int, int],
    treasures: list[tuple[int, int]],
) -> int:
    rows, cols = len(grid), len(grid[0])
    treasure_bit = {pos: 1 << i for i, pos in enumerate(treasures)}
    full_mask = (1 << len(treasures)) - 1
    if full_mask == 0:
        return 0
    start_mask = treasure_bit.get(start, 0)
    queue: deque = deque([(start[0], start[1], start_mask, 0)])
    visited = {(start[0], start[1], start_mask)}
    while queue:
        r, c, mask, d = queue.popleft()
        if mask == full_mask:
            return d
        for dr, dc in DIRS:
            nr, nc = r + dr, c + dc
            if 0 <= nr < rows and 0 <= nc < cols and grid[nr][nc] != 1:
                new_mask = mask | treasure_bit.get((nr, nc), 0)
                state = (nr, nc, new_mask)
                if state not in visited:
                    visited.add(state)
                    queue.append((nr, nc, new_mask, d + 1))
    return -1

Why this works: BFS over the augmented state space (row, col, mask) finds the shortest path in the implicit product graph (grid × bitmask). Since edge weights are uniform (every move costs 1), BFS is correct for shortest paths. The mask is monotonic—once a treasure bit is set, it stays set—so the state space is bounded by rows * cols * 2^K. The first time a state with mask == full_mask is dequeued, its distance is the shortest tour. Total time O(rows * cols * 2^K); memory O(rows * cols * 2^K) for the visited set.

PYTHON
Topic — dynamic programming
Dynamic programming problems

Practice →

COMPANY
Walmart — graph
Walmart-tagged graph

Practice →

Tips to crack Walmart data engineering interviews

These are habits that move the needle in real Walmart loops—not a re-statement of the topics above.

SQL preparation: drill string parsing, time-series rollups, anti-joins

Spend most of your SQL prep on string parsing (SPLIT_PART, LIKE, regex), time-series rollups (DATE_TRUNC + GROUP BY + JOIN), and anti-joins (NOT EXISTS and LEFT JOIN ... IS NULL). Type the queries; do not just read them. The aggregation, anti-join, and string-manipulation topic pages cover the bulk.

Python preparation: hash tables, BFS, DP basics

Walmart's Python rounds reward stdlib fluency over framework knowledge. Memorize: defaultdict(list) for grouping, Counter for balance checks, collections.deque for BFS, bitmask DP for multi-target optimization. Drill problems from the hash table, BFS, and dynamic programming topic pages.

Retail-domain framing

Walmart's prompts dress general algorithms in retail data: SKU codes, campaigns, stocked departments, badges, access logs, treasure paths. The interviewer is grading whether you map the framing to the algorithm correctly. State the mapping out loud: "this is sliding window over per-entity events"; "this is anti-join via NOT EXISTS"; "this is BFS on a grid"; "this is bitmask DP for multi-target." Mapping framings to families is the meta-skill.

Where to practice on PipeCode

Skill lane	Practice path
Curated Walmart practice set	/explore/practice/company/walmart
SQL aggregation	/explore/practice/topic/aggregation
SQL anti-join	/explore/practice/topic/anti-join
SQL string manipulation	/explore/practice/topic/string-manipulation
Python hash table	/explore/practice/topic/hash-table
Python BFS	/explore/practice/topic/bfs
Dynamic programming	/explore/practice/topic/dynamic-programming
All practice topics	/explore/practice/topics
Interview courses	/explore/courses

Communication under time pressure

State assumptions before typing: "I'll assume SKU format is fixed SKU-{CATEGORY}-{ID}"; "I'll assume timestamps are sorted within each badge"; "I'll assume the grid uses 4-neighbor adjacency, not 8." State invariants after key code blocks. State complexity: "this is O(E log E) for the per-badge sort." Interviewers grade clear reasoning above silent-and-perfect.

Frequently Asked Questions

What is the Walmart data engineering interview process like?

The Walmart data engineering interview typically includes a phone screen (Python or SQL warm-up), one or two coding rounds focused on SQL string parsing / aggregation and Python hash-table or BFS algorithms, a system-design conversation around pipelines and retail data architectures, and behavioral interviews. The curated 9-problem Walmart practice set on PipeCode mirrors what you will see on the technical rounds.

What SQL topics does Walmart test for data engineers?

Walmart emphasizes string parsing (SKU code extraction with SPLIT_PART, LIKE, regex), time-series rollups with DATE_TRUNC + GROUP BY + JOIN, and anti-joins for missing-relationship queries (NOT EXISTS, LEFT JOIN ... IS NULL). Drill these on the aggregation, anti-join, and string-manipulation topic pages.

How important is Python for a Walmart data engineering interview?

Python is roughly two-thirds of the technical interview at Walmart—and the patterns are pipeline-realistic, not algorithm puzzles. Expect hash tables for time-window security, access log mismatch detection, BFS on grids for distribution-center pathfinding, and DP-with-BFS for multi-target tours. Stdlib fluency (collections, functools, itertools) matters more than knowing the latest framework.

Are BFS and graph problems common in the Walmart data engineering interview?

Yes—two of the curated set's nine problems are grid BFS / DP-with-BFS (treasure pathfinding). Memorize the BFS skeleton with collections.deque and a visited set, plus the augmented-state BFS for multi-target problems. The BFS topic page has problems matching this shape.

How should I prepare for Walmart's retail-flavored prompts?

Build the mental model first: every retail-flavored prompt maps to a general algorithm. SKU codes = string parsing; campaign impressions = time-series rollup; stocked vs unstocked = anti-join; badge times = sliding window over per-entity events; access logs = balance check via Counter / stack; treasure paths = grid BFS or bitmask DP. State the mapping out loud during the interview—it shows the interviewer you've internalized the pattern recognition.

How many Walmart practice problems should I solve before the interview?

Aim for 30–50 problems spanning all eight topic clusters above—not 100 of the same kind. Solve every problem in the Walmart-tagged practice set, then back-fill weak areas using the topic pages linked throughout this guide.

Start practicing Walmart data engineering problems

Reading patterns is not the same as typing them under time pressure. PipeCode pairs company-tagged Walmart problems with tests, AI feedback, and a coding environment so you can drill the exact SQL and Python patterns Walmart asks—without the noise of generic algorithm prep that doesn't apply to this loop.

Pipecode.ai is Leetcode for Data Engineering.

Browse Walmart practice →
View all practice →

Top Walmart data engineering interview topics

1. SQL String Manipulation and Pattern Matching for Retail Codes

SQL string parsing for retail product codes in data engineering

SUBSTRING, LEFT, RIGHT, POSITION

Pattern matching with LIKE and SIMILAR TO

Regex extraction with REGEXP_REPLACE / REGEXP_MATCHES

SQL interview question on string parsing and pattern matching

Solution using SPLIT_PART and GROUP BY

2. SQL Aggregation, Joins, and Date Functions for Time-Series Rollups

SQL time-series rollups in data engineering

DATE_TRUNC to bucket by hour / day / week

GROUP BY on a truncated date alongside other keys

Joining a fact table to a campaign dimension

SQL interview question on time-series rollups

Solution using DATE_TRUNC, JOIN, and GROUP BY

3. SQL Anti-Joins for Missing-Relationship Queries

Anti-joins for missing-relationship queries in SQL for data engineering

LEFT JOIN ... WHERE rhs IS NULL

NOT EXISTS (correlated)

NULL trap of NOT IN

SQL interview question on anti-joins

Solution using NOT EXISTS

4. Python Math and Digit-by-Digit Primitives

Python math primitives in data engineering

Integer digit extraction with % 10 and // 10

String-conversion alternative (sum(int(c) for c in str(n)))

Recursive digit sum (digital root)

Python interview question on math primitives

Solution using the closed-form identity with iterative fallback

5. Python Hash Tables for Time-Window Security Alerts

Hash tables for time-window security analytics in Python for data engineering

Sort events by timestamp, scan with two pointers

Sliding-window count of accesses within T seconds

Per-badge alert tracking with defaultdict(list)

Python interview question on time-window security alerts

Solution using defaultdict(list) plus per-badge two-pointer sweep

6. Python Hash Tables for Access Log Mismatch Detection

Access log mismatch detection in Python for data engineering

Counter for entry/exit balance check

Pairing entries to exits with a stack

Detecting orphan entries / orphan exits

Python interview question on access log mismatch detection

Solution using a per-entity stack

7. Python BFS on Grid / Matrix for Shortest Paths

BFS on grid for shortest paths in Python for data engineering

Grid as implicit graph: 4-neighbor adjacency

BFS with a deque for shortest-path-on-unweighted-graph

Visited set to prevent cycles

Python interview question on grid BFS

Solution using deque BFS with a visited set

8. Python Dynamic Programming with BFS for Treasure-Finding

Dynamic programming with BFS for multi-target pathfinding in Python for data engineering

When to use DP-on-grid vs pure BFS

State expansion with bitmask DP for multi-target

Combining BFS for distance + DP for optimal path

Python interview question on DP + BFS for treasure-finding

Solution using BFS over augmented state space (row, col, mask)

Tips to crack Walmart data engineering interviews

SQL preparation: drill string parsing, time-series rollups, anti-joins

Python preparation: hash tables, BFS, DP basics

Retail-domain framing

Where to practice on PipeCode

Communication under time pressure

Frequently Asked Questions

What is the Walmart data engineering interview process like?

What SQL topics does Walmart test for data engineers?

How important is Python for a Walmart data engineering interview?

Are BFS and graph problems common in the Walmart data engineering interview?

How should I prepare for Walmart's retail-flavored prompts?

How many Walmart practice problems should I solve before the interview?

Start practicing Walmart data engineering problems

`SUBSTRING`, `LEFT`, `RIGHT`, `POSITION`

Pattern matching with `LIKE` and `SIMILAR TO`

Regex extraction with `REGEXP_REPLACE` / `REGEXP_MATCHES`

Solution using `SPLIT_PART` and `GROUP BY`

`DATE_TRUNC` to bucket by hour / day / week

`GROUP BY` on a truncated date alongside other keys

Solution using `DATE_TRUNC`, `JOIN`, and `GROUP BY`

`LEFT JOIN ... WHERE rhs IS NULL`

`NOT EXISTS` (correlated)

NULL trap of `NOT IN`

Solution using `NOT EXISTS`

Integer digit extraction with `% 10` and `// 10`

String-conversion alternative (`sum(int(c) for c in str(n))`)

Per-badge alert tracking with `defaultdict(list)`

Solution using `defaultdict(list)` plus per-badge two-pointer sweep

`Counter` for entry/exit balance check

BFS with a `deque` for shortest-path-on-unweighted-graph

Solution using `deque` BFS with a visited set

Solution using BFS over augmented state space `(row, col, mask)`