data engineering skills in 2026 are not the random "Python + SQL + a cloud" laundry list every job board prints. Hiring managers at FAANG and streaming-heavy shops have converged on a tight, testable shape: ten technical skills the loop probes with code on a whiteboard, and five soft skills the behavioural round measures with STAR questions. Miss any of the ten on a phone screen and the loop ends; miss any of the five in the loop and the offer stalls at the bar-raiser.
This guide unpacks the exact ten data engineer skills that gate every modern DE loop — SQL, Python, dimensional modelling, Spark, Kafka, Airflow, cloud, warehouses, dbt, and system design — and the five data engineer soft skills that gate promotion past mid-level — stakeholder management, on-call discipline, communication, mentoring, and prioritisation. Each section pairs a 30-second self-test you can run in your kitchen with a code snippet or a STAR script and a concept-by-concept breakdown of why interviewers reward that exact answer shape. By the end you'll have a clean skills for data engineer checklist, a self-rating grid, and a study plan that treats data engineering technical skills and behavioural muscles as one ordered ladder, not two separate worlds.
When you want hands-on reps immediately after reading, drill the SQL window-function library →, warm up on Python data-structure problems →, and rehearse end-to-end ETL drills →.
On this page
- Why "DE skills" means 10 technical + 5 soft
- The DE skills pyramid — foundational → advanced → senior
- Tier 1 technical — SQL, Python, dimensional modelling
- Modern-stack technical — Spark, Kafka, Airflow, cloud, warehouses, dbt, system design
- Soft skills — stakeholders, on-call, communication, mentoring, prioritisation
- Cheat sheet — DE skills self-assessment grid
- Frequently asked questions
- Practice on PipeCode
1. Why "DE skills" means 10 technical + 5 soft
The 2026 job description has converged on a 15-skill shape — 10 technical + 5 behavioural
Scrape 200 senior data engineering job descriptions from FAANG, payments, and streaming shops in 2026 and the same 15 skills appear in 80% of them. Ten are technical bullets with concrete tools attached ("must have SQL + Python + Spark at production scale, dbt + Airflow + one warehouse"). Five are behavioural bullets phrased as outcomes ("partner with product on metric definitions", "own the on-call rotation for the data platform"). The hiring loop maps cleanly onto the two groups — coding rounds test the technical ten, behavioural rounds test the soft five, system-design tests the seam between them.
The job-description average — what 80% of 2026 DE listings ask for.
- 12 technical bullets typically — the 10 we cover here, plus two team-specific extras (CDC tools, ML pipelines, real-time analytics, embeddings, depending on the team).
- 4 soft-skill bullets typically — usually framed as "partnering with product", "writing design docs", "mentoring juniors", "owning incidents."
- One "nice to have" list that often includes Scala, Rust, Go, Flink, Iceberg, Delta Lake — these signal the team's leading edge, not the bar.
The T-shape model — go deep on 2–3 things, stay broad on the rest.
- The vertical bar of the T is depth — two or three skills you can debug a production incident with at 3am. Most senior DEs go deep on one of
{SQL + warehouses}, one of{Spark + Python}, and one of{Kafka + streaming}. - The horizontal bar is breadth — you know the shape of every other tool well enough to read a config, ask the right question, and not slow the team down.
- Juniors fall into the "knows everything" trap — three weeks of every tool, zero hours of incident debugging on any of them. Recruiters can smell this on a resume.
Skills vs experience — skills get the interview, soft skills get the offer.
- Skills decide whether your resume passes the recruiter screen and whether your phone screen passes. They are necessary but not sufficient.
- Soft skills decide whether the bar-raiser writes "hire" or "no hire" on the debrief. Two candidates with identical technicals — the one with cleaner stakeholder stories gets the offer.
- The seam is system design. It tests both at once — can you reason about trade-offs (technical) and defend a choice in front of a sceptical peer (soft).
What the hiring loop is actually measuring.
- Phone screen. SQL fluency, basic Python, one design probe. Tier-1 technical only.
- On-site coding. Spark / dbt / Airflow knobs, Kafka semantics, dimensional modelling. Tier-2 technical.
- System design. One end-to-end pipeline; trade-offs, failure modes, cost. Bridge.
- Behavioural. STAR stories on stakeholder mgmt, on-call, communication, mentoring, prioritisation. Soft 5.
- Bar raiser. Cross-checks all of the above; weight on the behaviourals at senior+.
Worked example — self-test: list your top 3 skills in 60 seconds
Detailed explanation. A working DE should be able to name their top three skills, top three gaps, and one behavioural story without warming up. If you can't, your skills inventory is fuzzy and you'll be fuzzy in the loop too.
Question. In 60 seconds, name (a) your three deepest technical skills with one concrete production project for each, (b) your three weakest technical skills you would flag to a hiring manager, and (c) one stakeholder-management story in STAR shape. Time yourself.
Code (a notebook prompt you can run on yourself).
60-second skill inventory — write the answer; don't think.
(a) Top 3 technical skills + a 1-sentence production proof:
1. _______________________ — proof: __________________________
2. _______________________ — proof: __________________________
3. _______________________ — proof: __________________________
(b) 3 weakest technical skills (would honestly flag in an interview):
1. _______________________
2. _______________________
3. _______________________
(c) Stakeholder STAR (30 seconds spoken):
Situation: _______________________
Task: _______________________
Action: _______________________
Result: _______________________
Step-by-step explanation.
- Top three reveal where you actually go deep. If you can't produce a concrete project per skill, the skill is on your resume but not in your hands. Recruiters will catch this in 10 minutes.
- Weakest three reveal self-awareness. The candidate who says "honestly, Spark performance tuning is my weakest — I lean on the senior engineer for plan inspection" is rated higher than the candidate who claims 9/10 on everything.
- STAR story reveals communication. A clean 30-second STAR ("we needed an attribution metric → I wrote the spec with marketing → shipped a v0 in 5 days → reduced their reporting lag from 24h to 4h") signals senior. A 5-minute rambling story signals junior.
- Why 60 seconds? Phone screens compress this exact question into the first 90 seconds. Practising under pressure beats practising at leisure.
Output.
| Question | Senior answer shape | Junior answer shape |
|---|---|---|
| Top 3 skills + proof | tool + production project per skill | tool only, no proof |
| Weakest 3 | named honestly with a coping strategy | "I'm good at everything" |
| STAR story | 30s, 4 beats, clean result number | 3 minutes, no result |
Rule of thumb. If you can't name your top 3 + bottom 3 + one STAR in 60 seconds, you're not ready for a senior phone screen. Practise this drill weekly until it's automatic.
DE-skills interview question on prioritising what to study next
A common framing on the coffee chat before the on-site is: "Walk me through how you'd decide what to study next given a four-week prep window." It tests whether the candidate has a model for de skills triage, not just a wishlist.
Solution Using the gap × frequency × difficulty matrix
# study_priority.py — score each skill, sort, study the top 3
SKILLS = {
# skill: (self_rating_1_5, interview_frequency_1_5, recovery_difficulty_1_5)
"SQL window functions": (4, 5, 2),
"Python pandas": (5, 5, 1),
"Dimensional modelling": (3, 4, 3),
"Spark performance tuning": (2, 4, 5),
"Kafka EOS": (2, 3, 4),
"Airflow operators": (4, 4, 2),
"Snowflake clustering": (3, 3, 3),
"dbt tests + macros": (4, 4, 2),
"System design": (3, 5, 5),
"Stakeholder STAR stories": (3, 5, 3),
}
def score(rating, freq, diff):
# higher = study sooner; weight frequency and difficulty over self-rating
return (6 - rating) * freq * diff
ranked = sorted(SKILLS.items(), key=lambda kv: -score(*kv[1]))
for skill, (r, f, d) in ranked[:5]:
print(f"{skill:<32s} priority={score(r, f, d)} rating={r}/5 freq={f}/5 diff={d}/5")
Step-by-step trace.
| Skill | rating | freq | diff | (6-rating)*freq*diff |
|---|---|---|---|---|
| Spark performance tuning | 2 | 4 | 5 | 80 |
| System design | 3 | 5 | 5 | 75 |
| Kafka EOS | 2 | 3 | 4 | 48 |
| Stakeholder STAR stories | 3 | 5 | 3 | 45 |
| Dimensional modelling | 3 | 4 | 3 | 36 |
Output:
| Rank | Skill | Why it leads |
|---|---|---|
| 1 | Spark performance tuning | low rating × high freq × hardest to recover in interview |
| 2 | System design | medium rating × highest freq × hardest to learn from scratch |
| 3 | Kafka EOS | low rating × medium freq × hard to fake |
| 4 | Stakeholder STAR stories | medium rating × highest freq × medium difficulty |
| 5 | Dimensional modelling | medium-low rating × high freq × medium difficulty |
Why this works — concept by concept:
-
Self-rating inversion —
6 - ratingturns "low rating = high need." A 5/5 contributes 1; a 1/5 contributes 5. - Frequency weighting — multiplying by interview frequency keeps you from studying things that won't be asked.
-
Difficulty weighting — Spark perf tuning takes weeks of effort; SQL
WHEREtakes hours. The score correctly biases toward the multi-week skills. - Top-3 discipline — only the top 3 enter the four-week plan. A 15-skill plan is no plan.
- Cost — the scoring function is O(N) per skill; building the inventory is the work, not the math.
SQL
Topic — window functions
Window-function drills (SQL)
2. The DE skills pyramid — foundational → advanced → senior
Every senior DE skill assumes the layer below — skip a layer, fail an interview
The mental model in one line: DE skills stack as a pyramid where Layer 0 (SQL + Python + Git + Linux) holds up everything; Layer 1 adds the modern stack (one warehouse + dbt + Airflow + cloud basics); Layer 2 adds distributed compute and streaming (Spark + Kafka + system design); Layer 3 adds platform thinking (on-call ownership + mentoring + stakeholder management). Each layer assumes the one below. A candidate who can recite Spark internals but flounders on a recursive CTE is failing Layer 0 — the loop ends.
Layer 0 — foundational (Months 0–6 of a DE career).
- SQL. Joins, GROUP BY, CTEs, window functions, basic query plans. Tested in 100% of DE loops.
- Python. Data structures, iterators, comprehensions, file I/O, basic OOP. Tested in 80% of DE loops.
- Git. Branch, rebase, merge, conflict resolve. Not tested in coding but expected on day one.
-
Linux.
bash,grep,awk,find, file permissions,cron. Often surfaces in system-design "what would you do on the box" probes.
Layer 1 — intermediate (Months 6–24).
- One warehouse. Snowflake or BigQuery or Redshift — pick one and own it cold. Clustering, partitioning, micro-partitions, cost knobs.
- dbt. Models, refs, sources, tests, macros, snapshots. The default analytics-engineering layer.
- Airflow. DAGs, operators, sensors, scheduling, retries, backfills.
- Cloud basics. IAM, S3/GCS, managed services (Glue, Dataflow, EMR). Pick AWS / GCP / Azure; the others are similar.
Layer 2 — advanced (Years 2–4).
- Spark. DataFrames, lazy evaluation, shuffle, partitioning, broadcast joins, AQE.
- Kafka. Topics, partitions, consumer groups, EOS, transactions, KRaft.
- Streaming. Watermarks, windowing, late events, stream-table joins (Streams / Flink).
- System design. Lambda vs Kappa, lakehouse, CDC, batch vs streaming, idempotence.
Layer 3 — senior (Years 4+).
- Platform thinking. Build for the next 10 teams, not just your own. Self-serve interfaces, contracts, golden paths.
- On-call ownership. Runbooks, alerting hygiene, postmortems, paging budgets.
- Mentoring. Onboard juniors in <30 days; code reviews that teach.
- Stakeholder management. Translate business asks → SQL specs; manage expectations; defend scope.
The "each layer assumes the one below" rule.
- You cannot tune Spark shuffle without understanding partitioning, which depends on knowing what a join key is, which depends on Layer-0 SQL.
- You cannot defend a Kafka topic design without understanding throughput math, which depends on Python iterators + dict semantics, which depends on Layer-0 Python.
- You cannot mentor a junior on dbt model layout without first knowing dbt cold yourself (Layer 1).
- The interview probes downward — when a candidate fluffs a Layer-2 Spark question, the interviewer drops to Layer 1 (warehouse) and Layer 0 (SQL) to find where the foundation cracks. Don't have a cracked foundation.
Worked example — diagnose where a candidate's pyramid is broken
Detailed explanation. A common interviewer drill is: given a partial answer to a Spark question, decide which layer of the pyramid is weak and re-pitch the probe. The same mental exercise works in reverse on yourself.
Question. A candidate is asked to explain Spark repartition() vs coalesce(). They mumble about "shuffles" but get the direction wrong. Diagnose which layer of their pyramid is weak and design a follow-up probe.
Code (the probe ladder).
# probe_ladder.py — interviewer's mental script
def diagnose(spark_answer_quality):
if spark_answer_quality == "wrong direction":
# they don't know what a shuffle is
return "drop to Layer 1: 'walk me through what happens when you GROUP BY in Spark'"
if spark_answer_quality == "right direction, no why":
# they know the rule but not the cost model
return "drop to Layer 0: 'show me how you'd write that GROUP BY in SQL'"
if spark_answer_quality == "full explanation":
# they're solid; probe upward
return "climb to Layer 2: 'how would you tune the partition count for a 2TB join?'"
return "ambiguous — ask for a concrete production example"
print(diagnose("wrong direction"))
Step-by-step explanation.
- The interviewer hears "shuffles" but the direction is wrong (
coalescetriggers a shuffle in their answer; it doesn't unless going from many to fewer withshuffle=True). - That signals shaky Layer 2 — Spark internals. The interviewer drops to Layer 1.
- Layer 1 probe: "What happens when you
GROUP BYin Spark?" If the candidate can't describe the wide transformation, the issue is even deeper. - Layer 0 probe: "Write me a GROUP BY in plain SQL." If that fails, the foundation is cracked and no amount of Spark trivia will save the loop.
- The lesson for the candidate — when you study Spark, don't memorise the API. Trace it down to the SQL it would compile to. The pyramid stays whole.
Output.
| Candidate answer | Diagnosis | Next probe |
|---|---|---|
| "shuffles" wrong direction | shaky Layer 2 | drop to Layer 1 (warehouse GROUP BY) |
| right direction, no cost reason | shaky Layer 1 | drop to Layer 0 (SQL GROUP BY) |
| full explanation with cost | solid through Layer 2 | climb to Layer 3 (platform-scale tuning) |
Rule of thumb. When self-studying, the moment you find yourself memorising a Spark/Kafka API without being able to trace it down to a SQL or Python primitive, stop and fix the layer below. Every interviewer probes downward.
DE-skills interview question on layer prioritisation under time pressure
A senior interviewer might ask: "You have 4 weeks to ramp a junior on your team — how do you sequence the pyramid?" — testing whether the candidate has internalised the dependency order.
Solution Using strict bottom-up sequencing with weekly milestones
# 4_week_ramp.py — the only valid teaching order
PLAN = [
("Week 1", "Layer 0: SQL — joins + window functions + EXPLAIN"),
("Week 1", "Layer 0: Python — pandas + asyncio + pytest basics"),
("Week 2", "Layer 1: one warehouse — Snowflake clustering + cost model"),
("Week 2", "Layer 1: dbt models + refs + tests"),
("Week 3", "Layer 1: Airflow DAGs + retries + sensors"),
("Week 3", "Layer 2: Spark — DataFrames + shuffle + broadcast"),
("Week 4", "Layer 2: Kafka — topics + partitions + consumer groups"),
("Week 4", "Layer 2: one system-design pipeline end-to-end (CDC → warehouse)"),
]
for week, milestone in PLAN:
print(f"{week}: {milestone}")
Step-by-step trace.
| Week | Layer | Skill | Why this order |
|---|---|---|---|
| 1 | 0 | SQL + Python | every later layer depends on these |
| 2 | 1 | Warehouse + dbt | the daily working surface |
| 3 | 1 + 2 | Airflow + Spark | scheduler first, then compute |
| 4 | 2 | Kafka + system design | last because they assume all of the above |
Output:
| Layer | Time allocated | Outcome at end of week |
|---|---|---|
| 0 (SQL + Python) | 1 week | can write a window-function query + a pytest-tested pandas script |
| 1 (warehouse + dbt + Airflow) | 1.5 weeks | can ship a dbt model to a scheduled DAG |
| 2 (Spark + Kafka + design) | 1.5 weeks | can defend a 3-stage pipeline at a whiteboard |
Why this works — concept by concept:
- Bottom-up sequencing — Layer 0 is non-negotiable in week 1; you can't dbt-model a query you can't read.
- Dual-track weeks — pair a foundational skill with a layer-1 application so the learning sticks (SQL with a Snowflake query; Python with a pandas pipeline).
- One end-to-end project — week 4 forces the layers to connect; without it the junior learns trivia, not a system.
- No Layer 3 in 4 weeks — platform thinking and on-call ownership come from running real incidents, not from a curriculum.
- Cost — 4 weeks × 40 hours = 160 hours; covers Layers 0–2 in skim form. Mastery of any one Layer 2 skill takes another 8–12 weeks.
SQL
Topic — joins
SQL joins drills (foundational)
3. Tier 1 technical — SQL, Python, dimensional modelling
Tier 1 is the gatekeeper — fail SQL on the phone screen and the loop ends
These three are the non-negotiable Tier-1 data engineering technical skills. SQL gets tested in 100% of DE phone screens. Python gets tested in 80%. Dimensional modelling gets tested in 60% of senior+ loops — and in 100% of analytics-engineer loops. If you have only 80 hours this quarter, spend 60 of them here.
The three Tier-1 skills in one frame.
- SQL — the lingua franca of every warehouse, lake, and dbt model. Joins, windows, CTEs, query plans.
- Python — the scripting glue of every DAG, every API client, every test. pandas, requests, SQLAlchemy, asyncio, pytest.
- Dimensional modelling — the design language of every warehouse layer. Fact / dim / grain / SCDs.
Common gap — juniors over-index on Python, under-index on SQL.
- Python tutorials are abundant; SQL feels "old." Result: junior DEs can write a 200-line pandas pipeline but can't write a
RANK() OVER (PARTITION BY)from memory. - Fix the imbalance first. Every job at every warehouse-first shop is SQL-first. A
LEFT JOINtyped wrong in the phone screen kills 60% of loops.
SQL — joins, window functions, CTEs, query plans
Detailed explanation. Senior-DE SQL is not "SELECT * FROM users." It is window functions for ranking and gap detection, recursive CTEs for hierarchies, anti-joins for "find what's missing," and the ability to read an EXPLAIN plan to spot a missing index or a runaway nested loop.
Code (the four SQL primitives every DE must own).
-- 1. Window function — running total per user
SELECT
user_id,
order_date,
amount,
SUM(amount) OVER (
PARTITION BY user_id ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM orders;
-- 2. Recursive CTE — manager hierarchy
WITH RECURSIVE org AS (
SELECT employee_id, manager_id, 1 AS depth
FROM employees WHERE manager_id IS NULL
UNION ALL
SELECT e.employee_id, e.manager_id, org.depth + 1
FROM employees e
JOIN org ON e.manager_id = org.employee_id
)
SELECT * FROM org;
-- 3. Anti-join — users who never placed an order
SELECT u.user_id
FROM users u
LEFT JOIN orders o ON o.user_id = u.user_id
WHERE o.user_id IS NULL;
-- 4. EXPLAIN — confirm the planner uses the right index
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE user_id = 42 AND order_date >= '2026-01-01';
Step-by-step explanation.
-
Window function —
PARTITION BY user_id ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROWis the canonical running-total frame. Senior interviewers expect the explicitROWS BETWEENclause; juniors omit it. - Recursive CTE — the seed query selects the roots (top-level managers); the recursive part joins back. Every modern warehouse supports the standard syntax.
-
Anti-join —
LEFT JOIN ... WHERE IS NULLis the most-asked SQL pattern in the loop because it tests whether the candidate understands join semantics beyond INNER. The alternativeNOT EXISTSis also acceptable. - EXPLAIN — the buffered ANALYZE shows actual rows + actual time per node. The discriminator between mid and senior is whether the candidate looks at the plan first or guesses at fixes first.
Output.
| SQL skill | Coverage in interviews | Min reps to feel fluent |
|---|---|---|
| Joins (INNER/LEFT/RIGHT/FULL/CROSS) | 100% | 50+ |
| Window functions | 95% | 30+ |
| CTEs (simple + recursive) | 80% | 20+ |
| Anti-joins | 75% | 15+ |
| EXPLAIN plans | 60% | 10+ |
Common beginner mistakes.
- Writing
RANK()whenDENSE_RANK()was needed (or vice versa). Practice both. - Forgetting the explicit window frame and relying on the default (which differs across engines).
- Reaching for a subquery when a CTE would read better; or vice versa.
- Treating
EXPLAINas scary instead of a friend.
Python — pandas, requests, SQLAlchemy, asyncio, pytest
Detailed explanation. Senior-DE Python is not list comprehensions. It is pandas for the 80% of pipelines that fit in memory, requests for every API integration, SQLAlchemy for typed connections, asyncio for IO-bound concurrency, and pytest (with fixtures) for the test pyramid every reviewer expects.
Code (idiomatic Python every DE should be able to write cold).
# pipeline.py — a realistic DE script in 30 lines
import asyncio
import pandas as pd
import requests
import pytest
from sqlalchemy import create_engine
async def fetch_orders(session, day):
r = requests.get(f"https://api.example.com/orders?day={day}", timeout=10)
r.raise_for_status()
return r.json()
async def main(days):
tasks = [fetch_orders(None, d) for d in days]
raw = await asyncio.gather(*tasks)
df = pd.DataFrame([row for batch in raw for row in batch])
df["order_date"] = pd.to_datetime(df["order_date"])
df = df.groupby(["user_id", df["order_date"].dt.date]).agg(
orders=("order_id", "count"),
revenue=("amount", "sum"),
).reset_index()
engine = create_engine("postgresql+psycopg2://localhost/warehouse")
df.to_sql("daily_user_revenue", engine, if_exists="append", index=False)
# tests/test_pipeline.py
@pytest.fixture
def sample_df():
return pd.DataFrame({"user_id": [1, 1, 2], "amount": [10.0, 20.0, 5.0]})
def test_revenue_sum(sample_df):
assert sample_df.groupby("user_id")["amount"].sum().loc[1] == 30.0
Step-by-step explanation.
-
asyncio.gather— fans out N HTTP calls concurrently. For IO-bound work this is 5–20× faster than serial calls. -
pd.DataFrame([row for batch in raw for row in batch])— flattens a list of lists into a tidy DataFrame. Idiomatic. -
groupby+aggwith named tuples — readable aggregations; the senior shape. -
SQLAlchemy.create_engine— typed connection pooling; never rawpsycopg2for production scripts. -
pytestfixture — DataFrame in, assertion out. The minimum bar for a DE pull-request.
Output.
| Python primitive | Use case | Senior signal |
|---|---|---|
asyncio.gather |
IO-bound concurrency | knows the difference vs threads |
pandas.groupby.agg |
aggregations | uses named tuples, not .rename(columns=...)
|
SQLAlchemy.create_engine |
DB access | uses connection pooling, not raw cursors |
pytest.fixture |
unit tests | every PR ships at least one test |
| type hints | every function signature | uses from __future__ import annotations
|
Common beginner mistakes.
- Iterating row-by-row with
df.iterrows()instead of vectorising. Slow and un-idiomatic. - Using
requestsin a synchronous loop whenasynciowould fit; or worse, usingasynciowhenrequestsin a thread pool would be simpler. - Skipping
pytestbecause "it's a script." Every script is a candidate for review.
Dimensional modelling — fact / dim / grain / SCDs
Detailed explanation. Dimensional modelling is the Kimball language of the warehouse. Facts are the numeric measures (orders, clicks, revenue); dimensions are the descriptive attributes (user, product, date). Grain is the lowest level of detail in the fact table — "one row per order line per minute." SCDs (Slowly Changing Dimensions) capture how dimensions change over time without losing history.
Code (a star-schema DDL sketch + a SCD type-2 merge).
-- Fact + dimensions (star schema)
CREATE TABLE dim_user (
user_sk BIGINT PRIMARY KEY, -- surrogate key
user_id BIGINT NOT NULL, -- natural key
email TEXT,
country TEXT,
effective_from TIMESTAMP NOT NULL,
effective_to TIMESTAMP, -- NULL = currently active
is_current BOOLEAN NOT NULL
);
CREATE TABLE fact_orders (
order_sk BIGINT PRIMARY KEY,
user_sk BIGINT REFERENCES dim_user(user_sk),
product_sk BIGINT REFERENCES dim_product(product_sk),
order_ts TIMESTAMP NOT NULL,
amount NUMERIC(12, 2) NOT NULL,
quantity INTEGER NOT NULL
);
-- SCD type-2 merge — close the old row, open a new one
MERGE INTO dim_user AS tgt
USING staged_users AS src
ON tgt.user_id = src.user_id AND tgt.is_current = TRUE
WHEN MATCHED AND (tgt.email <> src.email OR tgt.country <> src.country) THEN
UPDATE SET effective_to = src.event_ts, is_current = FALSE;
INSERT INTO dim_user (user_sk, user_id, email, country, effective_from, effective_to, is_current)
SELECT nextval('user_sk_seq'), src.user_id, src.email, src.country,
src.event_ts, NULL, TRUE
FROM staged_users src
LEFT JOIN dim_user d
ON d.user_id = src.user_id AND d.is_current = TRUE
WHERE d.user_id IS NULL OR d.email <> src.email OR d.country <> src.country;
Step-by-step explanation.
-
Surrogate key (
user_sk) separates warehouse identity from source identity (user_id); lets SCD type-2 store many versions per natural key. -
fact_orders.user_skpoints at the version of the user that was current atorder_ts— preserves the "as it was then" view forever. -
SCD type-2 merge closes the prior row (
effective_to = src.event_ts,is_current = FALSE) and inserts a fresh row with the new attributes. -
is_currentboolean is a denormalised convenience for downstream queries; it must be flipped atomically with the new insert. - Type-1, type-3, type-6 are the other common SCDs — type 1 overwrites; type 3 keeps prior value in a side column; type 6 combines 1+2+3.
Output.
| Dimension table state at t=10 | user_sk | user_id | effective_from | effective_to | is_current | |
|---|---|---|---|---|---|---|
| original | 1001 | 42 | a@example.com | 2026-01-01 | NULL | true |
| After email change at t=20 | user_sk | user_id | effective_from | effective_to | is_current | |
|---|---|---|---|---|---|---|
| closed | 1001 | 42 | a@example.com | 2026-01-01 | 2026-02-15 | false |
| new | 1002 | 42 | new@example.com | 2026-02-15 | NULL | true |
Common beginner mistakes.
- Using the natural key as the foreign key in the fact table — destroys SCD history.
- Forgetting
is_current = TRUEin the mergeON— produces duplicate active rows. - Modelling at "one row per customer" when the grain is actually "one row per order" — flattens the truth.
DE-skills interview question on a window-function ranking
A common phone-screen probe: "Top 3 products by revenue per category in the last 30 days — write the SQL." Tests joins + filtering + window functions + LIMIT/QUALIFY semantics.
Solution Using ROW_NUMBER() OVER (PARTITION BY category) + QUALIFY
WITH last_30 AS (
SELECT
p.category,
p.product_id,
SUM(o.amount) AS revenue
FROM orders o
JOIN products p ON p.product_id = o.product_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY p.category, p.product_id
)
SELECT
category,
product_id,
revenue,
ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC) AS rnk
FROM last_30
QUALIFY rnk <= 3
ORDER BY category, rnk;
Step-by-step trace.
| Stage | Row count | Operation | Result |
|---|---|---|---|
| 1 | 1,000,000 orders | filter order_date >= now() - 30d
|
80,000 orders |
| 2 | 80,000 orders | join products on product_id
|
80,000 enriched rows |
| 3 | 80,000 enriched |
GROUP BY category, product_id + SUM(amount)
|
2,500 (category, product) rows |
| 4 | 2,500 rows | ROW_NUMBER() PARTITION BY category ORDER BY revenue DESC |
2,500 rows with rnk
|
| 5 | 2,500 rows | QUALIFY rnk <= 3 |
3 × number of categories rows |
Output:
| category | product_id | revenue | rnk |
|---|---|---|---|
| books | 71 | 9,400 | 1 |
| books | 88 | 7,150 | 2 |
| books | 23 | 6,800 | 3 |
| electronics | 105 | 24,300 | 1 |
| electronics | 142 | 21,000 | 2 |
| electronics | 198 | 18,400 | 3 |
Why this works — concept by concept:
-
CTE for the aggregate — the
last_30CTE keeps the window function operating on aggregated revenue, not raw orders. Reads top-to-bottom. -
ROW_NUMBER vs RANK vs DENSE_RANK —
ROW_NUMBERis the right pick for "top 3 with no ties";RANKwould skip ranks after a tie;DENSE_RANKwouldn't skip. - PARTITION BY category — restarts the numbering per category, giving "top 3 per category" instead of "top 3 overall."
-
QUALIFY — Snowflake/BigQuery filter on a window function in one pass; in Postgres you'd wrap in another CTE and filter on
rnk. -
Cost — O(N log N) for the sort inside the window; O(N) for the aggregate; small constant on top from
QUALIFY. Single pass on the aggregated set.
SQL
Topic — window functions
Top-N per group with window functions
Python
Topic — data structures
Python data-structure drills
SQL
Topic — dimensional modelling
Dimensional modelling problems
4. Modern-stack technical — Spark, Kafka, Airflow, cloud, warehouses, dbt, system design
Pick a stack and go deep — "stack-shopping" wastes interview prep time
The modern-stack layer is where skills for data engineer roles diverge from analytics-engineer roles. Spark and Kafka are the distributed-compute backbone; Airflow (or Dagster / Prefect) is the scheduler; one cloud + one warehouse + dbt round out the typical 2026 stack. System design is the seam that tests whether you can hold all of them in your head at once.
The 2026 stack-shopping trap.
- Juniors with three weeks of every tool look weak on every tool. Recruiters can smell it on the resume.
- Pick one tool per layer (e.g. AWS + Snowflake + Airflow + dbt + Spark + Kafka) and run a real project end-to-end. Real means a deployed pipeline a real consumer reads.
- If you don't know what stack to pick: AWS + Snowflake + Airflow + dbt + Spark + Kafka is the safest 2026 choice for FAANG-adjacent loops.
Apache Spark — DataFrames, lazy evaluation, shuffle, partitioning
Detailed explanation. Spark is the distributed-compute spine of the lakehouse. The model: DataFrames are lazy, every transformation builds a logical plan, actions trigger a physical plan, and shuffles (wide transformations like groupBy, join, repartition) are the dominant cost. The senior probe is always partitioning + shuffle + broadcast joins.
Code (the canonical Spark idiom: avoid the shuffle).
# orders_revenue.py — PySpark
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.appName("daily_revenue").getOrCreate()
orders = spark.read.parquet("s3://datalake/orders/") # 2 TB, partitioned by order_date
products = spark.read.parquet("s3://datalake/products/") # 50 MB, small enough to broadcast
# 1. Predicate pushdown — only read the last day's partition
orders_today = orders.where(F.col("order_date") == "2026-05-30")
# 2. Broadcast join — small side replicated to every executor; no shuffle of the big side
enriched = orders_today.join(
F.broadcast(products), on="product_id", how="inner"
)
# 3. groupBy is the unavoidable wide transformation
daily = (
enriched
.groupBy("category")
.agg(F.sum("amount").alias("revenue"), F.count("*").alias("orders"))
)
daily.write.mode("overwrite").parquet("s3://warehouse/daily_revenue/")
Step-by-step explanation.
-
Predicate pushdown —
.where(F.col("order_date") == "2026-05-30")filters at the Parquet level, scanning only one partition's files instead of the whole 2TB. -
Broadcast join —
F.broadcast(products)ships the 50MB dim to every executor; the 2TB fact is read once and joined locally. No shuffle of the big side. Saves hours. -
groupBy+agg— the genuinely wide transformation. Spark must shuffle records bycategoryso each executor sees all rows for one group. AQE (Adaptive Query Execution, default in Spark 3.x+) tunes the post-shuffle partition count automatically. -
Write to Parquet — columnar, splittable, compressible. The downstream warehouse can
COPYit straight in.
Output.
| Stage | Input size | Shuffled? | Output size |
|---|---|---|---|
| Read + predicate pushdown | 2 TB on disk | no | 50 GB scanned |
| Broadcast join | 50 GB + 50 MB | no shuffle of big side | 50 GB enriched |
| GroupBy aggregate | 50 GB | yes (shuffled by category) | 200 MB |
| Write parquet | 200 MB | no | 200 MB |
Common beginner mistakes.
- Reading the whole table without predicate pushdown.
- Forgetting
F.broadcast()on the small side of a join. - Using
repartition(N)whencoalesce(N)would do (no shuffle needed) — or vice versa.
Apache Kafka — topics, partitions, consumer groups, EOS
Detailed explanation. Kafka is the durable log between every producer and every consumer in a modern streaming pipeline. The four things every interview probes: partitions (unit of parallelism and ordering), consumer groups (partition fan-out), acks=all + min.insync.replicas=2 (durability), and exactly-once semantics (idempotent producer + transactions + read_committed).
Code (a durable producer + an at-least-once consumer).
# producer.py
from confluent_kafka import Producer
producer = Producer({
"bootstrap.servers": "kafka-1:9092",
"acks": "all",
"enable.idempotence": True,
"compression.type": "zstd",
"linger.ms": 10,
"batch.size": 65536,
})
producer.produce("orders", key=b"42", value=b'{"order_id":1001}')
producer.flush()
# consumer.py
from confluent_kafka import Consumer
consumer = Consumer({
"bootstrap.servers": "kafka-1:9092",
"group.id": "orders-to-warehouse",
"enable.auto.commit": False,
"isolation.level": "read_committed",
"partition.assignment.strategy": "cooperative-sticky",
})
consumer.subscribe(["orders"])
while True:
msg = consumer.poll(1.0)
if msg and not msg.error():
write_to_warehouse(msg.value()) # side effect first
consumer.commit(asynchronous=False) # then commit
Step-by-step explanation.
-
acks=all+enable.idempotence=true— the strongest durability primitive. The producer waits for every ISR member to ack; PID + sequence number deduplicates retries. -
compression.type=zstd— ~3× wire and disk savings. -
cooperative-sticky— the default in Kafka 4.x; only the affected partitions revoke on a consumer add/remove. - Manual commit after the side effect — at-least-once semantics; the warehouse needs an idempotent UPSERT to dedup re-deliveries.
-
isolation.level=read_committed— only see committed transactional records; skip in-flight ones.
Output.
| Concept | Default in Kafka 4.x | Why it matters |
|---|---|---|
| Partition assignment | cooperative-sticky | no stop-the-world rebalances |
| Producer idempotence | true | retries don't double-write |
| Consumer offset commit | manual after side effect | at-least-once is the safe default |
| EOS | idempotent + transactions + read_committed | atomic consume-transform-produce |
Common beginner mistakes.
- Leaving
enable.auto.commit=trueand then wondering about duplicate writes. - Reading
acks=allas "slow" when in practice it's still a few-ms RTT. - Forgetting
transactional.idwhen running multiple worker instances — no zombie fencing means duplicates after restart.
Apache Airflow — DAGs, operators, sensors, scheduling
Detailed explanation. Airflow (or Dagster / Prefect) is the scheduler that orchestrates DAGs of tasks. The senior probes: idempotent task design, backfills, sensors vs ExternalTaskSensor vs deferrable operators, executor choices (LocalExecutor / CeleryExecutor / KubernetesExecutor), and secret management.
Code (a realistic Airflow 2.x DAG with deferrable sensors).
# dags/daily_revenue.py
from airflow import DAG
from airflow.providers.amazon.aws.sensors.s3 import S3KeySensor
from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator
from datetime import datetime, timedelta
with DAG(
"daily_revenue",
start_date=datetime(2026, 1, 1),
schedule="0 2 * * *",
catchup=True,
max_active_runs=4,
default_args={"retries": 3, "retry_delay": timedelta(minutes=5)},
) as dag:
wait = S3KeySensor(
task_id="wait_for_orders",
bucket_key="orders/{{ ds }}/_SUCCESS",
bucket_name="datalake",
deferrable=True, # frees the worker slot while waiting
timeout=60 * 60,
)
load = SnowflakeOperator(
task_id="load_daily_revenue",
sql="""
INSERT INTO warehouse.daily_revenue
SELECT category, SUM(amount) AS revenue, COUNT(*) AS orders
FROM staging.orders
WHERE order_date = '{{ ds }}'
GROUP BY category;
""",
snowflake_conn_id="snowflake_default",
)
wait >> load
Step-by-step explanation.
-
start_date+catchup=True— backfills automatically fromstart_dateif the DAG falls behind. Idempotent task design makes this safe. -
{{ ds }}— the execution date macro; the task uses it to scope the SQL to one day. Re-runs for the same day produce the same output (idempotent). -
Deferrable sensor — the S3 sensor releases its worker slot while waiting; the Airflow triggerer handles the polling. Avoids the "sensor starvation" of the old
pokemode. -
max_active_runs=4— bounds concurrent DAG runs during a backfill; protects the warehouse from a thundering herd. -
retries=3, retry_delay=5min— handles transient warehouse/S3 errors automatically.
Output.
| Knob | Choice | Why |
|---|---|---|
schedule |
0 2 * * * (daily at 02:00) |
downstream warehouse cron lines up |
catchup |
True | safe because tasks are idempotent |
deferrable |
True on sensors | frees worker slots during waits |
max_active_runs |
4 | backfill without overloading the warehouse |
| Executor | KubernetesExecutor | isolated pod per task; right default for 2026 |
Common beginner mistakes.
- Non-idempotent tasks (
INSERTinstead ofMERGE/INSERT OVERWRITE). - Sensor in
pokemode hogging a worker slot for hours. - Hard-coded credentials in
paramsinstead ofConnection+ Secrets Backend.
Cloud — AWS, GCP, or Azure (pick one)
Detailed explanation. Every senior DE owns one cloud cold and reads the other two. The 2026 baseline: IAM (least privilege), object storage (S3 / GCS / ADLS), one compute primitive (Lambda / Cloud Functions / Azure Functions), one managed ETL (Glue / Dataflow / Data Factory), and one VPC / networking story.
Code (a typical S3 + IAM least-privilege pattern in Terraform).
# main.tf — least-privilege IAM for a data pipeline
resource "aws_iam_role" "etl" {
name = "etl-pipeline"
assume_role_policy = jsonencode({
Version = "2012-10-17",
Statement = [{
Effect = "Allow",
Principal = { Service = "lambda.amazonaws.com" },
Action = "sts:AssumeRole"
}]
})
}
resource "aws_iam_policy" "etl_rw_one_bucket" {
name = "etl-rw-one-bucket"
policy = jsonencode({
Version = "2012-10-17",
Statement = [{
Effect = "Allow",
Action = ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
Resource = [
"arn:aws:s3:::datalake-prod",
"arn:aws:s3:::datalake-prod/etl/*"
]
}]
})
}
resource "aws_iam_role_policy_attachment" "attach" {
role = aws_iam_role.etl.name
policy_arn = aws_iam_policy.etl_rw_one_bucket.arn
}
Step-by-step explanation.
-
assume_role_policy— only Lambda can assume the role. No wildcard principals. -
Inline
Actionlist — the role canGetObject,PutObject, andListBucket. Nots3:*. Not*. -
Resourcescoped to one bucket + one prefix — blast radius bounded to the etl prefix of the prod datalake. - No data-plane policy on the prod warehouse role — the warehouse role is a separate identity with no S3 write access.
- Every cloud has an equivalent. GCP service account + IAM binding; Azure managed identity + RBAC role assignment.
Output.
| Service | AWS | GCP | Azure |
|---|---|---|---|
| Object store | S3 | GCS | ADLS Gen2 |
| Compute (FaaS) | Lambda | Cloud Functions | Azure Functions |
| Managed ETL | Glue | Dataflow | Data Factory |
| Identity | IAM | IAM + service accounts | RBAC + Managed Identity |
| Secret store | Secrets Manager | Secret Manager | Key Vault |
Common beginner mistakes.
-
s3:*on*"just to get it working." Permanent prod state. - Using static IAM access keys instead of IRSA / Workload Identity / Managed Identity.
- One IAM role for the whole pipeline instead of per-task least privilege.
Warehouses — Snowflake, BigQuery, Redshift (pick one)
Detailed explanation. The warehouse is the daily working surface for 70% of senior DE work. Snowflake / BigQuery / Redshift differ in clustering vs partitioning semantics, cost model (credits vs slots vs nodes), and SQL flavor — but every senior probe is the same: scan reduction, clustering keys, cost containment.
Code (a Snowflake performance pattern — clustering + result cache).
-- Cluster the fact by the most-used predicate
ALTER TABLE warehouse.fact_orders CLUSTER BY (order_date, user_id);
-- Use the result cache: identical queries within 24h are free
SELECT category, SUM(amount) AS revenue
FROM warehouse.fact_orders
WHERE order_date BETWEEN '2026-05-01' AND '2026-05-30'
GROUP BY category;
-- Warehouse-level cost knob — auto-suspend keeps idle cost near zero
ALTER WAREHOUSE etl_wh SET
AUTO_SUSPEND = 60, -- suspend after 60s idle
AUTO_RESUME = TRUE,
MIN_CLUSTER_COUNT = 1,
MAX_CLUSTER_COUNT = 4; -- scale-out under contention
Step-by-step explanation.
-
CLUSTER BY (order_date, user_id)— Snowflake reorganises micro-partitions on this key, so range scans onorder_date(the most common predicate) skip 90%+ of micro-partitions. - Result cache — the same query within 24h with no underlying-data change is served from cache; you don't pay credits for it.
-
AUTO_SUSPEND=60— the warehouse stops billing 60s after idle. Critical for cost. - Multi-cluster warehouse — under contention, additional clusters spin up; under no contention, you pay for one.
Output.
| Operation | Without clustering | With clustering on (order_date, user_id) |
|---|---|---|
WHERE order_date = ... |
scans full table | scans ~1% of micro-partitions |
| Cost per query | $$$ | $ |
| Result cache hit | reads micro-partitions | free |
Common beginner mistakes.
- Clustering on a low-cardinality column ("status").
- Querying without a predicate on the clustering key.
- Leaving
AUTO_SUSPENDat the default 10 minutes for a sporadically-used warehouse.
dbt — models, refs, sources, tests, macros
Detailed explanation. dbt is the analytics-engineering layer that turns SQL into a tested, documented DAG. Every senior probe: ref() and source() discipline, tests (not_null, unique, relationships, custom singular tests), incremental models, macros, and dbt run --select state:modified+ in CI.
Code (a realistic incremental model with tests).
-- models/marts/fact_orders.sql
{{
config(
materialized = "incremental",
unique_key = "order_id",
incremental_strategy = "merge",
on_schema_change = "append_new_columns"
)
}}
WITH src AS (
SELECT *
FROM {{ source('raw', 'orders') }}
{% if is_incremental() %}
WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
{% endif %}
)
SELECT
order_id,
user_id,
product_id,
amount,
order_date,
updated_at
FROM src;
# models/marts/schema.yml
version: 2
models:
- name: fact_orders
description: "Order grain — one row per order, latest version."
columns:
- name: order_id
tests: [not_null, unique]
- name: user_id
tests:
- relationships:
to: ref('dim_user')
field: user_id
Step-by-step explanation.
-
materialized = 'incremental'— dbt builds the table once and merges deltas on subsequent runs; never re-processes the full history. -
unique_key = 'order_id'+incremental_strategy = 'merge'— the engine deduplicates on the key; updates flow through. -
is_incremental()macro — restricts the source scan to rows newer than the existing maxupdated_atin the target. -
source()— the contract with the raw layer; renaming the raw table breaks the dbt build immediately. -
Tests — every PR runs
not_null,unique, andrelationshipson every model. Failures block the merge.
Output.
| dbt primitive | Effect |
|---|---|
ref() |
enforces DAG ordering + lineage |
source() |
declares raw-layer dependencies |
tests |
block merges on data-quality violations |
incremental |
avoids reprocessing history |
state:modified+ |
CI only builds what changed + its descendants |
Common beginner mistakes.
- Hand-rolling
INSERTSQL instead of usingincrementalmaterialisation. - Skipping the
unique+not_nulltests on the primary key. - Forgetting
state:modified+in CI — slow PR cycles.
System design — the seam that tests everything at once
Detailed explanation. System design is the bridge skill — it requires every Tier-1 and Tier-2 skill at once. The frame: read the prompt → ask 3 clarifying questions → sketch the pipeline → discuss trade-offs → handle failure modes → estimate cost. The probe rewards trade-off reasoning over "right answers."
Code (a system-design checklist you can run mentally in any loop).
SYSTEM DESIGN CHECKLIST — 45-minute round
1. Clarify (3 minutes)
- What's the volume? (events/sec, GB/day, rows/year)
- What's the SLA? (freshness, downtime budget, query latency)
- Who's the consumer? (dashboard, ML, ops, regulator)
2. Sketch the pipeline (10 minutes)
- Source → ingest → storage → transform → serve
- Mark batch vs streaming on every edge
3. Trade-offs (15 minutes)
- Lambda vs Kappa (two systems vs one)
- Idempotence pattern (UPSERT, MERGE, dedupe key)
- Schema evolution (BACKWARD compat, contracts)
- Storage format (Parquet, Iceberg, Delta)
4. Failure modes (10 minutes)
- Source dies for 4 hours — what happens?
- Schema drift — who breaks first?
- 10× spike — where does it fall over?
- Bad data backfilled — how do you recover?
5. Cost estimate (7 minutes)
- Storage: GB/day × $/GB × retention
- Compute: hours × $/hour
- Network egress: GB/day × $/GB
- Warehouse credits: queries/day × avg credit/query
Step-by-step explanation.
- Always clarify first. A confident "let me make sure I understand the prompt — 3 things..." is the single highest-signal opening move.
- Sketch on the board immediately. Boxes + arrows + a one-word label per box. Don't talk for 10 minutes without drawing.
- Trade-off > right answer. "I'd reach for Kappa because we already have Kafka and one fewer system to operate; if the analytical patterns required heavy historical re-processing, I'd switch to Lambda."
- Failure modes test maturity. Anyone can draw a happy path; senior DEs draw the 4am page.
- Cost closes the round. "At 100k events/sec × 1KB × 30-day retention × 3× replication = 78TB of Kafka storage" — a back-of-envelope number wins points.
Output.
| Round phase | Senior signal | Junior anti-signal |
|---|---|---|
| Clarify | 3 sharp questions | dives into design |
| Sketch | starts drawing in minute 4 | talks for 15 minutes |
| Trade-offs | names 2 alternatives + the chosen one | "this is the best way" |
| Failure modes | draws the 4am page | "shouldn't happen" |
| Cost | back-of-envelope number | "I don't know" |
Common beginner mistakes.
- Jumping to "I'd use Kafka and Spark" without clarifying volume.
- Not drawing on the board.
- Never naming a trade-off.
DE-skills interview question on choosing batch vs streaming
A frequent on-site probe: "Should this pipeline be batch or streaming?" — testing whether the candidate can map SLA → architecture.
Solution Using the freshness × volume × cost matrix
def batch_or_stream(freshness_sla_minutes, volume_events_per_sec, ops_complexity_budget):
"""Return 'batch', 'streaming', or 'mini-batch' based on the requirement matrix."""
if freshness_sla_minutes <= 1:
return "streaming" # sub-minute SLA → no choice
if freshness_sla_minutes >= 60 and volume_events_per_sec < 1000:
return "batch" # hourly+ SLA, low volume → boring is good
if freshness_sla_minutes <= 15 and ops_complexity_budget == "low":
return "mini-batch" # 5–15 min, but team can't run streaming
return "streaming"
print(batch_or_stream(freshness_sla_minutes=5, volume_events_per_sec=50_000, ops_complexity_budget="high"))
Step-by-step trace.
| Scenario | freshness | volume | ops budget | decision |
|---|---|---|---|---|
| Real-time fraud | 1 sec | 5,000/s | high | streaming |
| Daily revenue dashboard | 1 hour | 200/s | low | batch |
| Marketing attribution | 10 min | 800/s | low | mini-batch |
| Ad bidding | 100 ms | 50,000/s | high | streaming |
Output:
| Pipeline shape | When to pick | Cost characteristic |
|---|---|---|
| Batch (Airflow + warehouse SQL) | hourly+ SLA, low complexity tolerance | cheapest, simplest |
| Mini-batch (5–15 min runs) | sub-hour SLA, low ops budget | middle ground |
| Streaming (Kafka + Flink/Spark Structured Streaming) | sub-minute SLA or high event rates | most expensive to run |
Why this works — concept by concept:
- SLA leads — freshness requirement is the only forcing function that takes batch off the table. Everything else is preference.
- Volume matters second — at very high event rates batch becomes operationally awkward (huge batches that miss SLAs).
- Ops complexity is the hidden axis — streaming requires on-call, watermarks, late-event handling, schema-evolution discipline. Teams without that maturity should stay in batch as long as the SLA allows.
-
Mini-batch is the polite compromise — Spark Structured Streaming with
trigger=AvailableNowgives sub-15-minute latency without a true streaming on-call burden. - Cost — batch ≈ 1×; mini-batch ≈ 2×; streaming ≈ 4–6× in cluster + on-call cost.
Streaming
Topic — streaming
Streaming pipeline problems (Kafka, Spark, EOS)
Design
Topic — design
DE system-design drills
Data
Topic — ETL
End-to-end ETL drills
5. Soft skills — stakeholders, on-call, communication, mentoring, prioritisation
The 5 soft skills are tested in behaviourals and gated for senior+ — not optional "people skills"
These five are the data engineer soft skills every modern DE loop probes in the behavioural round and every performance-review cycle measures after the hire. They are not "nice extras." A staff-engineer promotion packet that's strong on the technical 10 but weak on these 5 stalls — every time.
The behavioural-round mapping.
- "Tell me about a time you partnered with a non-technical stakeholder" → stakeholder management.
- "Walk me through your worst on-call incident" → on-call discipline.
- "Tell me about a design doc you wrote that was controversial" → communication.
- "Tell me about a junior you helped grow" → mentoring.
- "Tell me about a time you said no" → prioritisation.
STAR is the answer shape — every story, every time.
- Situation — set the stage in one sentence.
- Task — what you specifically owned.
- Action — what you did (always "I", not "we").
- Result — measurable outcome with a number.
Stakeholder management — translate business asks into SQL specs
Detailed explanation. Stakeholder management is the muscle of turning a vague marketing or product question ("why is signup conversion down?") into a clear data spec ("week-over-week conversion by traffic source for paid signups, excluding internal-test users, comparing last 8 weeks") and back to a business answer the stakeholder can act on. The senior signal is expectation management — saying "the v0 lands in 3 days; v1 with the cohort dimension in 2 weeks" instead of nodding silently and missing.
STAR template (stakeholder management).
SITUATION (1 sentence)
The marketing team came to me with a vague question: "why is signup conversion down?"
TASK (1 sentence)
I owned turning that into a data spec, building the metric, and reporting back in <1 week.
ACTION (3-5 short sentences)
- I scheduled a 30-min scoping call and wrote a shared one-pager with 4 candidate definitions.
- We picked "paid signups within 24h of first-touch, week-over-week, by source" together.
- I shipped a v0 SQL view in 3 days that any analyst could query.
- I set up a weekly review meeting and a #marketing-data Slack channel for follow-ups.
RESULT (1 sentence with a number)
The marketing team caught a 9% drop attributable to one paid channel within 2 weeks
and saved an estimated $40k of mis-attributed spend that quarter.
Step-by-step explanation.
- Scope-call + one-pager — turns vague ask into precise spec; killer for unwritten expectations.
- Pick the definition together — the stakeholder owns the metric; you own the SQL. Joint ownership prevents future "that's not what I meant."
- Ship v0 fast — momentum beats perfection; v0 surfaces the next questions.
- Set up the channel — async > meetings; the Slack channel scales with team size.
- Result with a number — "saved $40k" is the senior shape; "they were happy" is the junior shape.
Output.
| STAR beat | Senior shape | Junior shape |
|---|---|---|
| Situation | 1 sentence | 5 minutes of context |
| Task | "I owned X" | "we kind of did X" |
| Action | 4 sharp first-person actions | 8 vague we-statements |
| Result | quantified business impact | "they were happy" |
On-call discipline — runbooks, alerting hygiene, postmortems
Detailed explanation. On-call discipline is the muscle of running a data platform 24/7 without burnout. The four pillars: runbooks (every page has a documented response in <2 minutes), alerting hygiene (zero false positives; pages are real), postmortems (every incident gets a blameless write-up within 5 business days), paging budget (every team owns its noise; a noisy team is asked to fix it).
STAR template (on-call discipline).
SITUATION
We had a Kafka consumer-lag page that fired at 3am every other week.
TASK
I owned the on-call rotation for the streaming platform; this page was killing morale.
ACTION
- I pulled the last 6 weeks of incidents — 80% of the pages were resolved by "just wait."
- I wrote a runbook for the remaining 20% (real consumer-group rebalance issues).
- I raised the alert threshold from 1k lag to 50k lag and added a 10-min sustain window.
- I added a recovery-action playbook to the on-call dashboard.
- I scheduled a postmortem template-walkthrough for the team.
RESULT
Pages dropped 70% within one month; MTTR on real incidents fell from 25 min to 8 min
because the runbook was already open.
Step-by-step explanation.
- Pull the data first — "80% resolved by just wait" is the killer line. Data > opinion in incident reduction.
- Runbook for the real ones — the 20% deserve documentation; the 80% deserve to not page at all.
- Threshold + sustain — raises specificity. Most on-call noise is noisy thresholds.
- Recovery dashboard — the on-call engineer should reach for the dashboard before reaching for code.
- Postmortem cadence — blameless, time-boxed, and templated. Every incident becomes a teaching artifact.
Output.
| On-call hygiene primitive | Effect |
|---|---|
| Runbook per alert | MTTR drops |
| Threshold + sustain window | false positives drop |
| Blameless postmortem | repeat incidents drop |
| Paging budget per team | team owns its noise |
Communication — RFCs, design docs, async-first
Detailed explanation. Senior-DE communication is async-first written communication — RFCs, design docs, weekly status updates, postmortems. The signal: a 1-page design doc that a sceptical reviewer can read in 5 minutes and either approve or comment on with crisp blockers. Verbal communication matters too (1:1s, standups) but written is the leverage point.
STAR template (communication).
SITUATION
Our data platform was considering switching from Airflow to Dagster.
TASK
I owned writing the RFC that would gate the decision.
ACTION
- I wrote a 2-page RFC: problem statement, 3 options, recommendation, cost, rollback.
- I held two 30-min open-comment sessions for the 12 engineers on the platform team.
- I addressed every comment in a public reply thread with my reasoning.
- I scheduled a decision meeting with the directors and walked the doc in 10 minutes.
- I shipped a 1-page summary post-decision so the org could ramp on the choice.
RESULT
Decision made in 2 weeks (vs. the typical 8-week stall pattern); zero post-decision
litigation because every concern was on the public record.
Step-by-step explanation.
- RFC is the artifact — not a meeting; not a Slack thread; not a deck. The RFC is the source of truth.
- Open-comment session — async first, sync to unblock. Anyone who didn't comment forfeits the right to litigate later.
- Public reply thread — every concern is acknowledged in writing. Silence on a concern is the worst signal.
- Decision meeting walks the doc, not invents new content — saves 90% of decision-meeting time.
- Summary post — the doc is rich; the summary is what 50 other engineers will actually read.
Output.
| Communication artifact | Reader effort | When to reach for it |
|---|---|---|
| 1-page summary | 1 min | broad org broadcast |
| 2-page RFC | 5 min | every meaningful technical decision |
| Design doc (5–10 pages) | 15 min | quarter-scale projects |
| Postmortem (template) | 10 min | every incident |
| Weekly status | 2 min | every Friday, every project |
Mentoring — onboard juniors, code reviews that teach
Detailed explanation. Mentoring is the muscle of growing other engineers without burning out yourself. The pillars: 30-day onboarding plan (with weekly milestones), code reviews that teach (the comment explains why, not just what), paired debugging (one screen, two engineers, on the real bug), career conversations (quarterly, not "next promo cycle").
STAR template (mentoring).
SITUATION
A new-grad joined my team with strong CS fundamentals but no production DE experience.
TASK
I owned their 30-day onboarding plan and weekly 1:1s.
ACTION
- I wrote a 1-page 30-day plan: Week 1 SQL deep dive, Week 2 dbt model, Week 3 Airflow DAG, Week 4 end-to-end ticket.
- I paired with them for 1 hour on day 1, day 3, day 5, then 1:1s weekly.
- Every code review comment of mine included a "here's the why" link to a doc or a paper.
- I made sure they shipped a real production change in week 2 (small, but real).
- I introduced them to 5 senior engineers on adjacent teams in their first 2 weeks.
RESULT
They were taking on independent tickets by week 5 (target was week 6) and ranked
"strongly exceeds" on their 6-month review.
Step-by-step explanation.
- 30-day plan — removes ambiguity for the junior and creates accountability for the mentor.
- Paired debugging — the highest-bandwidth knowledge transfer. One hour of pairing = 5 hours of solo struggle.
- Code-review "why" — every comment links to a doc, a paper, or a prior incident. Teaches reasoning, not rules.
- Real production change in week 2 — momentum matters more than safety; small + real beats big + simulated.
- Adjacent-team intros — builds the junior's network; reduces their "everything goes through me" burden.
Output.
| Mentoring primitive | Junior-side effect | Mentor-side cost |
|---|---|---|
| 30-day plan | clear weekly milestones | 2 hours to write |
| Paired debugging | rapid skill transfer | 1 hour per session |
| Why-linked code review | teaches reasoning | 5 min per PR vs 2 min |
| Real prod change in week 2 | confidence + ownership | one extra reviewer cycle |
Prioritisation — saying no, defending scope, OKR alignment
Detailed explanation. Prioritisation is the muscle of saying no to a stakeholder without burning the relationship, and defending scope against a sceptical PM without becoming a roadblock. Senior DEs hold the line because they tie every request to OKRs and surface trade-offs explicitly.
STAR template (prioritisation).
SITUATION
Three product teams asked for "urgent" dashboard work in the same week.
TASK
I owned weekly prioritisation for our 4-engineer platform team.
ACTION
- I wrote a 1-pager mapping each ask to the company's two Q-OKRs (revenue + retention).
- Ask A mapped to revenue (P0); Ask B mapped to retention (P1); Ask C mapped to neither (P3).
- I took the 1-pager to the requesting PMs in a 20-min sync and said: "we can ship A this week, B next sprint, and C in Q+1 unless you can re-map it to an OKR."
- Ask C's PM brought a re-map argument; we accepted the re-map and bumped C to P2.
- I sent a recap email with the final priorities and timelines.
RESULT
Zero re-litigation of priorities for the quarter; team velocity rose 30% because
context-switching went down.
Step-by-step explanation.
- Map every ask to an OKR — the OKR is the shared currency; arguments framed in OKR-language carry weight.
- Make the "no" specific — not "no", but "we can ship A; B slips; C goes to Q+1 unless..."
- Give the requester a path — "unless you can re-map" is generous; it lets the requester salvage their ask.
- Recap in writing — the recap is the contract; reduces silent expectation drift.
- Velocity rises because context-switching falls — the senior signal is "I made the team faster by saying no to noise."
Output.
| Prioritisation primitive | Effect on team |
|---|---|
| OKR mapping | shared currency for arguments |
| Specific no + path | preserves the relationship |
| Written recap | reduces silent drift |
| Said-no number | counter-intuitive promotion signal |
DE-skills interview question on a controversial trade-off
A senior behavioural probe: "Tell me about a time you had to defend a technical decision in front of a sceptical audience." — testing communication + prioritisation + influence at once.
Solution Using the STAR + trade-off-and-counter shape
SITUATION (15s)
Our streaming pipeline was missing the 99.9% freshness SLA twice a month.
The product team wanted us to migrate from Spark Structured Streaming to Flink.
TASK (10s)
I owned the recommendation; the platform director wanted my call in 2 weeks.
ACTION (90s)
- I instrumented the pipeline and found 80% of misses came from one upstream connector,
not the stream engine itself.
- I wrote an RFC with three options: (a) migrate to Flink, (b) fix the connector,
(c) raise the SLA from 99.9% to 99.5% and revisit in 6 months.
- I held an open-comment session; the product team initially pushed for (a).
- I walked the numbers: (a) cost = 12 engineer-weeks + new on-call burden, (b) cost
= 2 weeks + zero new operational surface, (c) cost = zero engineering but stakeholder
trust hit.
- I recommended (b) and committed to a 2-week timeline + a rollback to (a) if misses
didn't drop below 1/month.
- Product team agreed because the trade-off was on paper, not in our heads.
RESULT (15s)
We shipped the connector fix in 9 days; misses dropped to 0 in the following 8 weeks;
saved an estimated 10 engineer-weeks of unnecessary platform migration.
Step-by-step trace.
| STAR beat | Time | Senior cue |
|---|---|---|
| Situation | 15s | one sentence, no padding |
| Task | 10s | "I owned" — first-person ownership |
| Action | 90s | 4 sharp action verbs + named trade-off |
| Result | 15s | quantified savings + a number |
Output:
| Story dimension | Senior shape | Junior shape |
|---|---|---|
| Length | 2 minutes total | 6 minutes total |
| Voice | "I" | "we" |
| Data | numbers in the action | numbers absent |
| Outcome | engineer-weeks saved | "they were happy" |
| Decision artifact | RFC + open comment | meeting + Slack |
Why this works — concept by concept:
- Data-first action — instrumenting before recommending is the senior move. It earns the right to push back on the product team's preferred answer.
- RFC with 3 options — never bring a recommendation without alternatives; the alternatives are how trust is earned.
- Quantified trade-off — "12 engineer-weeks + new on-call burden" beats "this would be a lot of work." Numbers unlock decisions.
- Rollback commitment — "rollback to (a) if misses don't drop below 1/month" is the reversibility clause that lets the team commit.
- Result with a number — "saved 10 engineer-weeks" is the bar-raiser-ready close.
- Cost — preparing one STAR story takes 30–45 minutes; the muscle of thinking in STAR comes from doing 10–15 of them.
Behavioural
Course — behavioural prep
Behavioural interview prep for DEs
Product
Course — product sense + metrics
Product sense and metrics for DE interviews
Cheat sheet — DE skills self-assessment grid
The grid below is a 15-row self-rating sheet — 10 technical + 5 soft. Rate yourself on each row against the three columns (Junior / Mid / Senior expectation), then circle your three weakest skills as study priorities for the next quarter.
| Skill | Junior expectation | Mid expectation | Senior expectation |
|---|---|---|---|
| SQL | joins, GROUP BY | window fns, CTEs, EXPLAIN | recursive CTEs, plan tuning, dialect quirks |
| Python | functions, dict, list | pandas, requests, pytest | asyncio, SQLAlchemy, performance profiling |
| Dimensional modelling | fact + dim concept | grain, SCD types 1/2 | conformed dims, bus matrix, SCD type 6 |
| Apache Spark | DataFrames basics | shuffle, broadcast, AQE | partition pruning, perf tuning at scale, custom UDFs |
| Apache Kafka | topics + partitions | consumer groups, acks | EOS, transactions, KRaft, zombie fencing |
| Airflow | write a DAG | sensors, retries, backfills | deferrable operators, custom executors, secret backends |
| Cloud | console clickops | IAM least privilege, S3/GCS, FaaS | Terraform, multi-region, cost ownership |
| Warehouses | write a SELECT | clustering, partitioning, cost | warehouse-level cost optimisation, governance |
| dbt | model + ref | tests, sources, incremental | macros, exposures, state:modified+ in CI |
| System design | one-box pipeline | trade-off naming | failure-mode + cost in the answer |
| Stakeholder management | take a ticket | translate ask → spec | manage expectations + drive priorities |
| On-call discipline | follow a runbook | tune alerts, write runbooks | own the rotation, drive paging budget |
| Communication | Slack thread | 1-page summary | 2-page RFC + open-comment sessions |
| Mentoring | help when asked | onboard a junior in 30 days | grow a team; codify the curriculum |
| Prioritisation | finish the sprint | scope the quarter | say no with OKR mapping; defend scope |
How to use the grid.
- Self-rate each row Junior / Mid / Senior against where you actually operate today.
- Star three rows where you're lower than the level you're interviewing for.
- For each starred row, pick one practice loop from PipeCode and run it weekly.
- Re-rate every 8 weeks; you should see at least 2 rows shift by one level per quarter.
Frequently asked questions
How many programming languages should a data engineer know?
One language cold — usually Python — plus SQL as a first-class language, plus enough shell/bash to be dangerous on a Linux box. A second language (Scala or Java) is a strong "nice to have" for JVM-stack shops (Spark on Scala, Kafka Streams in Java) but not a gate for most 2026 DE roles. Don't chase a third language unless your target team uses it daily; depth beats breadth in the loop.
Is Java required for data engineering?
Not for most DE roles in 2026. Spark on Python (PySpark) covers the majority of distributed-compute work; Kafka clients in Python (confluent-kafka) cover most messaging work. Java becomes important if you want to write Kafka Streams applications, contribute to Apache projects, or work at a Spark-on-Scala shop like Netflix or Databricks. If you have to pick one, pick Python and reach for Scala or Java when the team you're joining uses them.
How important is system design for DE interviews?
Very — system design is the seam round that tests every other skill at once. Most senior+ loops include one 45–60 minute system-design round; staff+ loops include two. The probe rewards trade-off reasoning over right answers — clarify the prompt, sketch the pipeline, name two alternatives, draw the failure modes, and put a back-of-envelope cost on it. A weak system-design round will tank an otherwise strong loop.
Do I need to know machine learning for DE?
Not deeply. The bar is "you can ingest features, store them with low-latency lookup, schedule retraining, and not lose track of model lineage" — i.e. you understand the data side of ML platforms. You do not need to derive gradient descent or tune transformer hyper-parameters. If your target team is ML-infrastructure-heavy (Uber's Michelangelo, Spotify's Hendrix), spend 2–3 weeks on feature stores, model registries, and online inference patterns — not on the math.
Are soft skills tested in DE interviews?
Yes, in every senior+ loop. The behavioural round is typically 45–60 minutes of STAR questions across the five soft skills — stakeholder management, on-call discipline, communication, mentoring, and prioritisation. Bar-raisers and hiring managers weigh this round at least as heavily as the technical rounds. Prepare 8–12 STAR stories, each ~2 minutes long, that you can map to any of the five themes; the muscle is in the mapping, not the memorising.
Which skill should I learn next if I already know SQL + Python?
In order: (1) one warehouse cold (Snowflake or BigQuery — clustering, partitioning, cost) → (2) dbt — models, refs, sources, tests, incremental → (3) Airflow — DAGs, sensors, deferrable operators → (4) Spark — DataFrames, shuffle, broadcast joins. After that pick one of Kafka (if your target team is streaming-heavy) or system design (if your target loop has a design round). Resist the urge to start three at once; finish one before opening the next.
Practice on PipeCode
- Drill the SQL window-function library → — the single highest-leverage Tier-1 skill.
- Warm up on Python data-structure problems → to keep the language muscle warm.
- Practise end-to-end ETL drills → to glue SQL + Python + scheduling into one shape.
- Sharpen streaming pipeline problems → for the Kafka and Spark Structured Streaming muscle.
- Run DE system-design drills → — the seam skill that gates senior+ loops.
- Read top data engineering interview questions → for the broader interview shape.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Build the SQL foundation with SQL for data engineering interviews →.
- Build the Python foundation with Python for data engineering interviews →.
- Build the Spark depth with Apache Spark internals for DE interviews →.
- Build the design muscle with ETL system design for DE interviews →.
- Build the modelling muscle with data modelling for DE interviews →.
- Build the behavioural muscle with behavioural interview prep for DEs →.
- Build the product-sense muscle with product sense and metrics for DE interviews →.
Pipecode.ai is Leetcode for Data Engineering — every technical skill in the 10-skill grid above ships with a hands-on practice room where you write the SQL, fix the Spark plan, or trace the Kafka rebalance on real data. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, plus the behavioural course bundle for the soft 5.
Practice SQL window functions now →
DE system-design drills →





Top comments (0)