DEV Community

Cover image for Data Engineering Skills: 10 Technical + 5 Soft Skills Hiring Managers Test
Gowtham Potureddi
Gowtham Potureddi

Posted on

Data Engineering Skills: 10 Technical + 5 Soft Skills Hiring Managers Test

data engineering skills in 2026 are not the random "Python + SQL + a cloud" laundry list every job board prints. Hiring managers at FAANG and streaming-heavy shops have converged on a tight, testable shape: ten technical skills the loop probes with code on a whiteboard, and five soft skills the behavioural round measures with STAR questions. Miss any of the ten on a phone screen and the loop ends; miss any of the five in the loop and the offer stalls at the bar-raiser.

This guide unpacks the exact ten data engineer skills that gate every modern DE loop — SQL, Python, dimensional modelling, Spark, Kafka, Airflow, cloud, warehouses, dbt, and system design — and the five data engineer soft skills that gate promotion past mid-level — stakeholder management, on-call discipline, communication, mentoring, and prioritisation. Each section pairs a 30-second self-test you can run in your kitchen with a code snippet or a STAR script and a concept-by-concept breakdown of why interviewers reward that exact answer shape. By the end you'll have a clean skills for data engineer checklist, a self-rating grid, and a study plan that treats data engineering technical skills and behavioural muscles as one ordered ladder, not two separate worlds.

PipeCode blog header for a guide to 10 technical + 5 soft data engineering skills — bold white headline 'DE Skills · 10 Technical + 5 Soft' with subtitle 'SQL · Python · Spark · Kafka · Cloud · Stakeholders · On-call' and a stylised skills-pyramid infographic with 3 ascending layers on a dark gradient with purple, green, orange, and blue accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, drill the SQL window-function library →, warm up on Python data-structure problems →, and rehearse end-to-end ETL drills →.


On this page


1. Why "DE skills" means 10 technical + 5 soft

The 2026 job description has converged on a 15-skill shape — 10 technical + 5 behavioural

Scrape 200 senior data engineering job descriptions from FAANG, payments, and streaming shops in 2026 and the same 15 skills appear in 80% of them. Ten are technical bullets with concrete tools attached ("must have SQL + Python + Spark at production scale, dbt + Airflow + one warehouse"). Five are behavioural bullets phrased as outcomes ("partner with product on metric definitions", "own the on-call rotation for the data platform"). The hiring loop maps cleanly onto the two groups — coding rounds test the technical ten, behavioural rounds test the soft five, system-design tests the seam between them.

The job-description average — what 80% of 2026 DE listings ask for.

  • 12 technical bullets typically — the 10 we cover here, plus two team-specific extras (CDC tools, ML pipelines, real-time analytics, embeddings, depending on the team).
  • 4 soft-skill bullets typically — usually framed as "partnering with product", "writing design docs", "mentoring juniors", "owning incidents."
  • One "nice to have" list that often includes Scala, Rust, Go, Flink, Iceberg, Delta Lake — these signal the team's leading edge, not the bar.

The T-shape model — go deep on 2–3 things, stay broad on the rest.

  • The vertical bar of the T is depth — two or three skills you can debug a production incident with at 3am. Most senior DEs go deep on one of {SQL + warehouses}, one of {Spark + Python}, and one of {Kafka + streaming}.
  • The horizontal bar is breadth — you know the shape of every other tool well enough to read a config, ask the right question, and not slow the team down.
  • Juniors fall into the "knows everything" trap — three weeks of every tool, zero hours of incident debugging on any of them. Recruiters can smell this on a resume.

Skills vs experience — skills get the interview, soft skills get the offer.

  • Skills decide whether your resume passes the recruiter screen and whether your phone screen passes. They are necessary but not sufficient.
  • Soft skills decide whether the bar-raiser writes "hire" or "no hire" on the debrief. Two candidates with identical technicals — the one with cleaner stakeholder stories gets the offer.
  • The seam is system design. It tests both at once — can you reason about trade-offs (technical) and defend a choice in front of a sceptical peer (soft).

What the hiring loop is actually measuring.

  • Phone screen. SQL fluency, basic Python, one design probe. Tier-1 technical only.
  • On-site coding. Spark / dbt / Airflow knobs, Kafka semantics, dimensional modelling. Tier-2 technical.
  • System design. One end-to-end pipeline; trade-offs, failure modes, cost. Bridge.
  • Behavioural. STAR stories on stakeholder mgmt, on-call, communication, mentoring, prioritisation. Soft 5.
  • Bar raiser. Cross-checks all of the above; weight on the behaviourals at senior+.

Worked example — self-test: list your top 3 skills in 60 seconds

Detailed explanation. A working DE should be able to name their top three skills, top three gaps, and one behavioural story without warming up. If you can't, your skills inventory is fuzzy and you'll be fuzzy in the loop too.

Question. In 60 seconds, name (a) your three deepest technical skills with one concrete production project for each, (b) your three weakest technical skills you would flag to a hiring manager, and (c) one stakeholder-management story in STAR shape. Time yourself.

Code (a notebook prompt you can run on yourself).

60-second skill inventory — write the answer; don't think.

(a) Top 3 technical skills + a 1-sentence production proof:
  1. _______________________  — proof: __________________________
  2. _______________________  — proof: __________________________
  3. _______________________  — proof: __________________________

(b) 3 weakest technical skills (would honestly flag in an interview):
  1. _______________________
  2. _______________________
  3. _______________________

(c) Stakeholder STAR (30 seconds spoken):
  Situation: _______________________
  Task:      _______________________
  Action:    _______________________
  Result:    _______________________
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Top three reveal where you actually go deep. If you can't produce a concrete project per skill, the skill is on your resume but not in your hands. Recruiters will catch this in 10 minutes.
  2. Weakest three reveal self-awareness. The candidate who says "honestly, Spark performance tuning is my weakest — I lean on the senior engineer for plan inspection" is rated higher than the candidate who claims 9/10 on everything.
  3. STAR story reveals communication. A clean 30-second STAR ("we needed an attribution metric → I wrote the spec with marketing → shipped a v0 in 5 days → reduced their reporting lag from 24h to 4h") signals senior. A 5-minute rambling story signals junior.
  4. Why 60 seconds? Phone screens compress this exact question into the first 90 seconds. Practising under pressure beats practising at leisure.

Output.

Question Senior answer shape Junior answer shape
Top 3 skills + proof tool + production project per skill tool only, no proof
Weakest 3 named honestly with a coping strategy "I'm good at everything"
STAR story 30s, 4 beats, clean result number 3 minutes, no result

Rule of thumb. If you can't name your top 3 + bottom 3 + one STAR in 60 seconds, you're not ready for a senior phone screen. Practise this drill weekly until it's automatic.

DE-skills interview question on prioritising what to study next

A common framing on the coffee chat before the on-site is: "Walk me through how you'd decide what to study next given a four-week prep window." It tests whether the candidate has a model for de skills triage, not just a wishlist.

Solution Using the gap × frequency × difficulty matrix

# study_priority.py — score each skill, sort, study the top 3
SKILLS = {
    # skill: (self_rating_1_5, interview_frequency_1_5, recovery_difficulty_1_5)
    "SQL window functions":      (4, 5, 2),
    "Python pandas":             (5, 5, 1),
    "Dimensional modelling":     (3, 4, 3),
    "Spark performance tuning":  (2, 4, 5),
    "Kafka EOS":                 (2, 3, 4),
    "Airflow operators":         (4, 4, 2),
    "Snowflake clustering":      (3, 3, 3),
    "dbt tests + macros":        (4, 4, 2),
    "System design":             (3, 5, 5),
    "Stakeholder STAR stories":  (3, 5, 3),
}

def score(rating, freq, diff):
    # higher = study sooner; weight frequency and difficulty over self-rating
    return (6 - rating) * freq * diff

ranked = sorted(SKILLS.items(), key=lambda kv: -score(*kv[1]))
for skill, (r, f, d) in ranked[:5]:
    print(f"{skill:<32s} priority={score(r, f, d)}  rating={r}/5  freq={f}/5  diff={d}/5")
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Skill rating freq diff (6-rating)*freq*diff
Spark performance tuning 2 4 5 80
System design 3 5 5 75
Kafka EOS 2 3 4 48
Stakeholder STAR stories 3 5 3 45
Dimensional modelling 3 4 3 36

Output:

Rank Skill Why it leads
1 Spark performance tuning low rating × high freq × hardest to recover in interview
2 System design medium rating × highest freq × hardest to learn from scratch
3 Kafka EOS low rating × medium freq × hard to fake
4 Stakeholder STAR stories medium rating × highest freq × medium difficulty
5 Dimensional modelling medium-low rating × high freq × medium difficulty

Why this works — concept by concept:

  • Self-rating inversion6 - rating turns "low rating = high need." A 5/5 contributes 1; a 1/5 contributes 5.
  • Frequency weighting — multiplying by interview frequency keeps you from studying things that won't be asked.
  • Difficulty weighting — Spark perf tuning takes weeks of effort; SQL WHERE takes hours. The score correctly biases toward the multi-week skills.
  • Top-3 discipline — only the top 3 enter the four-week plan. A 15-skill plan is no plan.
  • Cost — the scoring function is O(N) per skill; building the inventory is the work, not the math.

SQL
Topic — window functions
Window-function drills (SQL)

Practice →


2. The DE skills pyramid — foundational → advanced → senior

Every senior DE skill assumes the layer below — skip a layer, fail an interview

The mental model in one line: DE skills stack as a pyramid where Layer 0 (SQL + Python + Git + Linux) holds up everything; Layer 1 adds the modern stack (one warehouse + dbt + Airflow + cloud basics); Layer 2 adds distributed compute and streaming (Spark + Kafka + system design); Layer 3 adds platform thinking (on-call ownership + mentoring + stakeholder management). Each layer assumes the one below. A candidate who can recite Spark internals but flounders on a recursive CTE is failing Layer 0 — the loop ends.

Visual 4-layer DE skills pyramid — bottom to top: Foundational (SQL, Python, Git, Linux), Intermediate (one warehouse + dbt + Airflow + cloud basics), Advanced (Spark, Kafka, streaming, system design), Senior (platform thinking, on-call ownership, mentoring, stakeholder management); each layer with example skill pills; on a light PipeCode card.

Layer 0 — foundational (Months 0–6 of a DE career).

  • SQL. Joins, GROUP BY, CTEs, window functions, basic query plans. Tested in 100% of DE loops.
  • Python. Data structures, iterators, comprehensions, file I/O, basic OOP. Tested in 80% of DE loops.
  • Git. Branch, rebase, merge, conflict resolve. Not tested in coding but expected on day one.
  • Linux. bash, grep, awk, find, file permissions, cron. Often surfaces in system-design "what would you do on the box" probes.

Layer 1 — intermediate (Months 6–24).

  • One warehouse. Snowflake or BigQuery or Redshift — pick one and own it cold. Clustering, partitioning, micro-partitions, cost knobs.
  • dbt. Models, refs, sources, tests, macros, snapshots. The default analytics-engineering layer.
  • Airflow. DAGs, operators, sensors, scheduling, retries, backfills.
  • Cloud basics. IAM, S3/GCS, managed services (Glue, Dataflow, EMR). Pick AWS / GCP / Azure; the others are similar.

Layer 2 — advanced (Years 2–4).

  • Spark. DataFrames, lazy evaluation, shuffle, partitioning, broadcast joins, AQE.
  • Kafka. Topics, partitions, consumer groups, EOS, transactions, KRaft.
  • Streaming. Watermarks, windowing, late events, stream-table joins (Streams / Flink).
  • System design. Lambda vs Kappa, lakehouse, CDC, batch vs streaming, idempotence.

Layer 3 — senior (Years 4+).

  • Platform thinking. Build for the next 10 teams, not just your own. Self-serve interfaces, contracts, golden paths.
  • On-call ownership. Runbooks, alerting hygiene, postmortems, paging budgets.
  • Mentoring. Onboard juniors in <30 days; code reviews that teach.
  • Stakeholder management. Translate business asks → SQL specs; manage expectations; defend scope.

The "each layer assumes the one below" rule.

  • You cannot tune Spark shuffle without understanding partitioning, which depends on knowing what a join key is, which depends on Layer-0 SQL.
  • You cannot defend a Kafka topic design without understanding throughput math, which depends on Python iterators + dict semantics, which depends on Layer-0 Python.
  • You cannot mentor a junior on dbt model layout without first knowing dbt cold yourself (Layer 1).
  • The interview probes downward — when a candidate fluffs a Layer-2 Spark question, the interviewer drops to Layer 1 (warehouse) and Layer 0 (SQL) to find where the foundation cracks. Don't have a cracked foundation.

Worked example — diagnose where a candidate's pyramid is broken

Detailed explanation. A common interviewer drill is: given a partial answer to a Spark question, decide which layer of the pyramid is weak and re-pitch the probe. The same mental exercise works in reverse on yourself.

Question. A candidate is asked to explain Spark repartition() vs coalesce(). They mumble about "shuffles" but get the direction wrong. Diagnose which layer of their pyramid is weak and design a follow-up probe.

Code (the probe ladder).

# probe_ladder.py — interviewer's mental script
def diagnose(spark_answer_quality):
    if spark_answer_quality == "wrong direction":
        # they don't know what a shuffle is
        return "drop to Layer 1: 'walk me through what happens when you GROUP BY in Spark'"
    if spark_answer_quality == "right direction, no why":
        # they know the rule but not the cost model
        return "drop to Layer 0: 'show me how you'd write that GROUP BY in SQL'"
    if spark_answer_quality == "full explanation":
        # they're solid; probe upward
        return "climb to Layer 2: 'how would you tune the partition count for a 2TB join?'"
    return "ambiguous — ask for a concrete production example"

print(diagnose("wrong direction"))
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The interviewer hears "shuffles" but the direction is wrong (coalesce triggers a shuffle in their answer; it doesn't unless going from many to fewer with shuffle=True).
  2. That signals shaky Layer 2 — Spark internals. The interviewer drops to Layer 1.
  3. Layer 1 probe: "What happens when you GROUP BY in Spark?" If the candidate can't describe the wide transformation, the issue is even deeper.
  4. Layer 0 probe: "Write me a GROUP BY in plain SQL." If that fails, the foundation is cracked and no amount of Spark trivia will save the loop.
  5. The lesson for the candidate — when you study Spark, don't memorise the API. Trace it down to the SQL it would compile to. The pyramid stays whole.

Output.

Candidate answer Diagnosis Next probe
"shuffles" wrong direction shaky Layer 2 drop to Layer 1 (warehouse GROUP BY)
right direction, no cost reason shaky Layer 1 drop to Layer 0 (SQL GROUP BY)
full explanation with cost solid through Layer 2 climb to Layer 3 (platform-scale tuning)

Rule of thumb. When self-studying, the moment you find yourself memorising a Spark/Kafka API without being able to trace it down to a SQL or Python primitive, stop and fix the layer below. Every interviewer probes downward.

DE-skills interview question on layer prioritisation under time pressure

A senior interviewer might ask: "You have 4 weeks to ramp a junior on your team — how do you sequence the pyramid?" — testing whether the candidate has internalised the dependency order.

Solution Using strict bottom-up sequencing with weekly milestones

# 4_week_ramp.py — the only valid teaching order
PLAN = [
    ("Week 1", "Layer 0: SQL — joins + window functions + EXPLAIN"),
    ("Week 1", "Layer 0: Python — pandas + asyncio + pytest basics"),
    ("Week 2", "Layer 1: one warehouse — Snowflake clustering + cost model"),
    ("Week 2", "Layer 1: dbt models + refs + tests"),
    ("Week 3", "Layer 1: Airflow DAGs + retries + sensors"),
    ("Week 3", "Layer 2: Spark — DataFrames + shuffle + broadcast"),
    ("Week 4", "Layer 2: Kafka — topics + partitions + consumer groups"),
    ("Week 4", "Layer 2: one system-design pipeline end-to-end (CDC → warehouse)"),
]
for week, milestone in PLAN:
    print(f"{week}: {milestone}")
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Week Layer Skill Why this order
1 0 SQL + Python every later layer depends on these
2 1 Warehouse + dbt the daily working surface
3 1 + 2 Airflow + Spark scheduler first, then compute
4 2 Kafka + system design last because they assume all of the above

Output:

Layer Time allocated Outcome at end of week
0 (SQL + Python) 1 week can write a window-function query + a pytest-tested pandas script
1 (warehouse + dbt + Airflow) 1.5 weeks can ship a dbt model to a scheduled DAG
2 (Spark + Kafka + design) 1.5 weeks can defend a 3-stage pipeline at a whiteboard

Why this works — concept by concept:

  • Bottom-up sequencing — Layer 0 is non-negotiable in week 1; you can't dbt-model a query you can't read.
  • Dual-track weeks — pair a foundational skill with a layer-1 application so the learning sticks (SQL with a Snowflake query; Python with a pandas pipeline).
  • One end-to-end project — week 4 forces the layers to connect; without it the junior learns trivia, not a system.
  • No Layer 3 in 4 weeks — platform thinking and on-call ownership come from running real incidents, not from a curriculum.
  • Cost — 4 weeks × 40 hours = 160 hours; covers Layers 0–2 in skim form. Mastery of any one Layer 2 skill takes another 8–12 weeks.

SQL
Topic — joins
SQL joins drills (foundational)

Practice →


3. Tier 1 technical — SQL, Python, dimensional modelling

Tier 1 is the gatekeeper — fail SQL on the phone screen and the loop ends

These three are the non-negotiable Tier-1 data engineering technical skills. SQL gets tested in 100% of DE phone screens. Python gets tested in 80%. Dimensional modelling gets tested in 60% of senior+ loops — and in 100% of analytics-engineer loops. If you have only 80 hours this quarter, spend 60 of them here.

Visual diagram of the Tier 1 DE technical skills — three side-by-side panels (SQL, Python, Dimensional modeling), each with a tiny code-shape icon, a sub-skill checklist, and a 'tested in every interview' chip; a small annotation about over/under-indexing in juniors; on a light PipeCode card.

The three Tier-1 skills in one frame.

  • SQL — the lingua franca of every warehouse, lake, and dbt model. Joins, windows, CTEs, query plans.
  • Python — the scripting glue of every DAG, every API client, every test. pandas, requests, SQLAlchemy, asyncio, pytest.
  • Dimensional modelling — the design language of every warehouse layer. Fact / dim / grain / SCDs.

Common gap — juniors over-index on Python, under-index on SQL.

  • Python tutorials are abundant; SQL feels "old." Result: junior DEs can write a 200-line pandas pipeline but can't write a RANK() OVER (PARTITION BY) from memory.
  • Fix the imbalance first. Every job at every warehouse-first shop is SQL-first. A LEFT JOIN typed wrong in the phone screen kills 60% of loops.

SQL — joins, window functions, CTEs, query plans

Detailed explanation. Senior-DE SQL is not "SELECT * FROM users." It is window functions for ranking and gap detection, recursive CTEs for hierarchies, anti-joins for "find what's missing," and the ability to read an EXPLAIN plan to spot a missing index or a runaway nested loop.

Code (the four SQL primitives every DE must own).

-- 1. Window function — running total per user
SELECT
  user_id,
  order_date,
  amount,
  SUM(amount) OVER (
    PARTITION BY user_id ORDER BY order_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS running_total
FROM orders;

-- 2. Recursive CTE — manager hierarchy
WITH RECURSIVE org AS (
  SELECT employee_id, manager_id, 1 AS depth
  FROM employees WHERE manager_id IS NULL
  UNION ALL
  SELECT e.employee_id, e.manager_id, org.depth + 1
  FROM employees e
  JOIN org ON e.manager_id = org.employee_id
)
SELECT * FROM org;

-- 3. Anti-join — users who never placed an order
SELECT u.user_id
FROM users u
LEFT JOIN orders o ON o.user_id = u.user_id
WHERE o.user_id IS NULL;

-- 4. EXPLAIN — confirm the planner uses the right index
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE user_id = 42 AND order_date >= '2026-01-01';
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Window functionPARTITION BY user_id ORDER BY order_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is the canonical running-total frame. Senior interviewers expect the explicit ROWS BETWEEN clause; juniors omit it.
  2. Recursive CTE — the seed query selects the roots (top-level managers); the recursive part joins back. Every modern warehouse supports the standard syntax.
  3. Anti-joinLEFT JOIN ... WHERE IS NULL is the most-asked SQL pattern in the loop because it tests whether the candidate understands join semantics beyond INNER. The alternative NOT EXISTS is also acceptable.
  4. EXPLAIN — the buffered ANALYZE shows actual rows + actual time per node. The discriminator between mid and senior is whether the candidate looks at the plan first or guesses at fixes first.

Output.

SQL skill Coverage in interviews Min reps to feel fluent
Joins (INNER/LEFT/RIGHT/FULL/CROSS) 100% 50+
Window functions 95% 30+
CTEs (simple + recursive) 80% 20+
Anti-joins 75% 15+
EXPLAIN plans 60% 10+

Common beginner mistakes.

  • Writing RANK() when DENSE_RANK() was needed (or vice versa). Practice both.
  • Forgetting the explicit window frame and relying on the default (which differs across engines).
  • Reaching for a subquery when a CTE would read better; or vice versa.
  • Treating EXPLAIN as scary instead of a friend.

Python — pandas, requests, SQLAlchemy, asyncio, pytest

Detailed explanation. Senior-DE Python is not list comprehensions. It is pandas for the 80% of pipelines that fit in memory, requests for every API integration, SQLAlchemy for typed connections, asyncio for IO-bound concurrency, and pytest (with fixtures) for the test pyramid every reviewer expects.

Code (idiomatic Python every DE should be able to write cold).

# pipeline.py — a realistic DE script in 30 lines
import asyncio
import pandas as pd
import requests
import pytest
from sqlalchemy import create_engine

async def fetch_orders(session, day):
    r = requests.get(f"https://api.example.com/orders?day={day}", timeout=10)
    r.raise_for_status()
    return r.json()

async def main(days):
    tasks = [fetch_orders(None, d) for d in days]
    raw = await asyncio.gather(*tasks)
    df = pd.DataFrame([row for batch in raw for row in batch])
    df["order_date"] = pd.to_datetime(df["order_date"])
    df = df.groupby(["user_id", df["order_date"].dt.date]).agg(
        orders=("order_id", "count"),
        revenue=("amount", "sum"),
    ).reset_index()
    engine = create_engine("postgresql+psycopg2://localhost/warehouse")
    df.to_sql("daily_user_revenue", engine, if_exists="append", index=False)

# tests/test_pipeline.py
@pytest.fixture
def sample_df():
    return pd.DataFrame({"user_id": [1, 1, 2], "amount": [10.0, 20.0, 5.0]})

def test_revenue_sum(sample_df):
    assert sample_df.groupby("user_id")["amount"].sum().loc[1] == 30.0
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. asyncio.gather — fans out N HTTP calls concurrently. For IO-bound work this is 5–20× faster than serial calls.
  2. pd.DataFrame([row for batch in raw for row in batch]) — flattens a list of lists into a tidy DataFrame. Idiomatic.
  3. groupby + agg with named tuples — readable aggregations; the senior shape.
  4. SQLAlchemy.create_engine — typed connection pooling; never raw psycopg2 for production scripts.
  5. pytest fixture — DataFrame in, assertion out. The minimum bar for a DE pull-request.

Output.

Python primitive Use case Senior signal
asyncio.gather IO-bound concurrency knows the difference vs threads
pandas.groupby.agg aggregations uses named tuples, not .rename(columns=...)
SQLAlchemy.create_engine DB access uses connection pooling, not raw cursors
pytest.fixture unit tests every PR ships at least one test
type hints every function signature uses from __future__ import annotations

Common beginner mistakes.

  • Iterating row-by-row with df.iterrows() instead of vectorising. Slow and un-idiomatic.
  • Using requests in a synchronous loop when asyncio would fit; or worse, using asyncio when requests in a thread pool would be simpler.
  • Skipping pytest because "it's a script." Every script is a candidate for review.

Dimensional modelling — fact / dim / grain / SCDs

Detailed explanation. Dimensional modelling is the Kimball language of the warehouse. Facts are the numeric measures (orders, clicks, revenue); dimensions are the descriptive attributes (user, product, date). Grain is the lowest level of detail in the fact table — "one row per order line per minute." SCDs (Slowly Changing Dimensions) capture how dimensions change over time without losing history.

Code (a star-schema DDL sketch + a SCD type-2 merge).

-- Fact + dimensions (star schema)
CREATE TABLE dim_user (
  user_sk        BIGINT PRIMARY KEY,        -- surrogate key
  user_id        BIGINT NOT NULL,           -- natural key
  email          TEXT,
  country        TEXT,
  effective_from TIMESTAMP NOT NULL,
  effective_to   TIMESTAMP,                 -- NULL = currently active
  is_current     BOOLEAN NOT NULL
);

CREATE TABLE fact_orders (
  order_sk       BIGINT PRIMARY KEY,
  user_sk        BIGINT REFERENCES dim_user(user_sk),
  product_sk     BIGINT REFERENCES dim_product(product_sk),
  order_ts       TIMESTAMP NOT NULL,
  amount         NUMERIC(12, 2) NOT NULL,
  quantity       INTEGER NOT NULL
);

-- SCD type-2 merge — close the old row, open a new one
MERGE INTO dim_user AS tgt
USING staged_users    AS src
ON tgt.user_id = src.user_id AND tgt.is_current = TRUE
WHEN MATCHED AND (tgt.email <> src.email OR tgt.country <> src.country) THEN
  UPDATE SET effective_to = src.event_ts, is_current = FALSE;

INSERT INTO dim_user (user_sk, user_id, email, country, effective_from, effective_to, is_current)
SELECT nextval('user_sk_seq'), src.user_id, src.email, src.country,
       src.event_ts, NULL, TRUE
FROM staged_users src
LEFT JOIN dim_user d
  ON d.user_id = src.user_id AND d.is_current = TRUE
WHERE d.user_id IS NULL OR d.email <> src.email OR d.country <> src.country;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Surrogate key (user_sk) separates warehouse identity from source identity (user_id); lets SCD type-2 store many versions per natural key.
  2. fact_orders.user_sk points at the version of the user that was current at order_ts — preserves the "as it was then" view forever.
  3. SCD type-2 merge closes the prior row (effective_to = src.event_ts, is_current = FALSE) and inserts a fresh row with the new attributes.
  4. is_current boolean is a denormalised convenience for downstream queries; it must be flipped atomically with the new insert.
  5. Type-1, type-3, type-6 are the other common SCDs — type 1 overwrites; type 3 keeps prior value in a side column; type 6 combines 1+2+3.

Output.

Dimension table state at t=10 user_sk user_id email effective_from effective_to is_current
original 1001 42 a@example.com 2026-01-01 NULL true
After email change at t=20 user_sk user_id email effective_from effective_to is_current
closed 1001 42 a@example.com 2026-01-01 2026-02-15 false
new 1002 42 new@example.com 2026-02-15 NULL true

Common beginner mistakes.

  • Using the natural key as the foreign key in the fact table — destroys SCD history.
  • Forgetting is_current = TRUE in the merge ON — produces duplicate active rows.
  • Modelling at "one row per customer" when the grain is actually "one row per order" — flattens the truth.

DE-skills interview question on a window-function ranking

A common phone-screen probe: "Top 3 products by revenue per category in the last 30 days — write the SQL." Tests joins + filtering + window functions + LIMIT/QUALIFY semantics.

Solution Using ROW_NUMBER() OVER (PARTITION BY category) + QUALIFY

WITH last_30 AS (
  SELECT
    p.category,
    p.product_id,
    SUM(o.amount) AS revenue
  FROM orders o
  JOIN products p ON p.product_id = o.product_id
  WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 days'
  GROUP BY p.category, p.product_id
)
SELECT
  category,
  product_id,
  revenue,
  ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC) AS rnk
FROM last_30
QUALIFY rnk <= 3
ORDER BY category, rnk;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Stage Row count Operation Result
1 1,000,000 orders filter order_date >= now() - 30d 80,000 orders
2 80,000 orders join products on product_id 80,000 enriched rows
3 80,000 enriched GROUP BY category, product_id + SUM(amount) 2,500 (category, product) rows
4 2,500 rows ROW_NUMBER() PARTITION BY category ORDER BY revenue DESC 2,500 rows with rnk
5 2,500 rows QUALIFY rnk <= 3 3 × number of categories rows

Output:

category product_id revenue rnk
books 71 9,400 1
books 88 7,150 2
books 23 6,800 3
electronics 105 24,300 1
electronics 142 21,000 2
electronics 198 18,400 3

Why this works — concept by concept:

  • CTE for the aggregate — the last_30 CTE keeps the window function operating on aggregated revenue, not raw orders. Reads top-to-bottom.
  • ROW_NUMBER vs RANK vs DENSE_RANKROW_NUMBER is the right pick for "top 3 with no ties"; RANK would skip ranks after a tie; DENSE_RANK wouldn't skip.
  • PARTITION BY category — restarts the numbering per category, giving "top 3 per category" instead of "top 3 overall."
  • QUALIFY — Snowflake/BigQuery filter on a window function in one pass; in Postgres you'd wrap in another CTE and filter on rnk.
  • Cost — O(N log N) for the sort inside the window; O(N) for the aggregate; small constant on top from QUALIFY. Single pass on the aggregated set.

SQL
Topic — window functions
Top-N per group with window functions

Practice →

Python
Topic — data structures
Python data-structure drills

Practice →

SQL
Topic — dimensional modelling
Dimensional modelling problems

Practice →


4. Modern-stack technical — Spark, Kafka, Airflow, cloud, warehouses, dbt, system design

Pick a stack and go deep — "stack-shopping" wastes interview prep time

The modern-stack layer is where skills for data engineer roles diverge from analytics-engineer roles. Spark and Kafka are the distributed-compute backbone; Airflow (or Dagster / Prefect) is the scheduler; one cloud + one warehouse + dbt round out the typical 2026 stack. System design is the seam that tests whether you can hold all of them in your head at once.

Visual diagram of the modern DE stack — six tool cards arranged in a 2x3 grid (Spark, Kafka, Airflow, Cloud, Warehouses, dbt), each with a tiny icon, 2-3 key concept pills, and a one-line 'why interview tests this'; on a light PipeCode card.

The 2026 stack-shopping trap.

  • Juniors with three weeks of every tool look weak on every tool. Recruiters can smell it on the resume.
  • Pick one tool per layer (e.g. AWS + Snowflake + Airflow + dbt + Spark + Kafka) and run a real project end-to-end. Real means a deployed pipeline a real consumer reads.
  • If you don't know what stack to pick: AWS + Snowflake + Airflow + dbt + Spark + Kafka is the safest 2026 choice for FAANG-adjacent loops.

Apache Spark — DataFrames, lazy evaluation, shuffle, partitioning

Detailed explanation. Spark is the distributed-compute spine of the lakehouse. The model: DataFrames are lazy, every transformation builds a logical plan, actions trigger a physical plan, and shuffles (wide transformations like groupBy, join, repartition) are the dominant cost. The senior probe is always partitioning + shuffle + broadcast joins.

Code (the canonical Spark idiom: avoid the shuffle).

# orders_revenue.py — PySpark
from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.appName("daily_revenue").getOrCreate()

orders   = spark.read.parquet("s3://datalake/orders/")           # 2 TB, partitioned by order_date
products = spark.read.parquet("s3://datalake/products/")         # 50 MB, small enough to broadcast

# 1. Predicate pushdown — only read the last day's partition
orders_today = orders.where(F.col("order_date") == "2026-05-30")

# 2. Broadcast join — small side replicated to every executor; no shuffle of the big side
enriched = orders_today.join(
    F.broadcast(products), on="product_id", how="inner"
)

# 3. groupBy is the unavoidable wide transformation
daily = (
    enriched
    .groupBy("category")
    .agg(F.sum("amount").alias("revenue"), F.count("*").alias("orders"))
)

daily.write.mode("overwrite").parquet("s3://warehouse/daily_revenue/")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Predicate pushdown.where(F.col("order_date") == "2026-05-30") filters at the Parquet level, scanning only one partition's files instead of the whole 2TB.
  2. Broadcast joinF.broadcast(products) ships the 50MB dim to every executor; the 2TB fact is read once and joined locally. No shuffle of the big side. Saves hours.
  3. groupBy + agg — the genuinely wide transformation. Spark must shuffle records by category so each executor sees all rows for one group. AQE (Adaptive Query Execution, default in Spark 3.x+) tunes the post-shuffle partition count automatically.
  4. Write to Parquet — columnar, splittable, compressible. The downstream warehouse can COPY it straight in.

Output.

Stage Input size Shuffled? Output size
Read + predicate pushdown 2 TB on disk no 50 GB scanned
Broadcast join 50 GB + 50 MB no shuffle of big side 50 GB enriched
GroupBy aggregate 50 GB yes (shuffled by category) 200 MB
Write parquet 200 MB no 200 MB

Common beginner mistakes.

  • Reading the whole table without predicate pushdown.
  • Forgetting F.broadcast() on the small side of a join.
  • Using repartition(N) when coalesce(N) would do (no shuffle needed) — or vice versa.

Apache Kafka — topics, partitions, consumer groups, EOS

Detailed explanation. Kafka is the durable log between every producer and every consumer in a modern streaming pipeline. The four things every interview probes: partitions (unit of parallelism and ordering), consumer groups (partition fan-out), acks=all + min.insync.replicas=2 (durability), and exactly-once semantics (idempotent producer + transactions + read_committed).

Code (a durable producer + an at-least-once consumer).

# producer.py
from confluent_kafka import Producer
producer = Producer({
    "bootstrap.servers": "kafka-1:9092",
    "acks": "all",
    "enable.idempotence": True,
    "compression.type": "zstd",
    "linger.ms": 10,
    "batch.size": 65536,
})
producer.produce("orders", key=b"42", value=b'{"order_id":1001}')
producer.flush()

# consumer.py
from confluent_kafka import Consumer
consumer = Consumer({
    "bootstrap.servers": "kafka-1:9092",
    "group.id": "orders-to-warehouse",
    "enable.auto.commit": False,
    "isolation.level": "read_committed",
    "partition.assignment.strategy": "cooperative-sticky",
})
consumer.subscribe(["orders"])
while True:
    msg = consumer.poll(1.0)
    if msg and not msg.error():
        write_to_warehouse(msg.value())   # side effect first
        consumer.commit(asynchronous=False)  # then commit
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. acks=all + enable.idempotence=true — the strongest durability primitive. The producer waits for every ISR member to ack; PID + sequence number deduplicates retries.
  2. compression.type=zstd — ~3× wire and disk savings.
  3. cooperative-sticky — the default in Kafka 4.x; only the affected partitions revoke on a consumer add/remove.
  4. Manual commit after the side effect — at-least-once semantics; the warehouse needs an idempotent UPSERT to dedup re-deliveries.
  5. isolation.level=read_committed — only see committed transactional records; skip in-flight ones.

Output.

Concept Default in Kafka 4.x Why it matters
Partition assignment cooperative-sticky no stop-the-world rebalances
Producer idempotence true retries don't double-write
Consumer offset commit manual after side effect at-least-once is the safe default
EOS idempotent + transactions + read_committed atomic consume-transform-produce

Common beginner mistakes.

  • Leaving enable.auto.commit=true and then wondering about duplicate writes.
  • Reading acks=all as "slow" when in practice it's still a few-ms RTT.
  • Forgetting transactional.id when running multiple worker instances — no zombie fencing means duplicates after restart.

Apache Airflow — DAGs, operators, sensors, scheduling

Detailed explanation. Airflow (or Dagster / Prefect) is the scheduler that orchestrates DAGs of tasks. The senior probes: idempotent task design, backfills, sensors vs ExternalTaskSensor vs deferrable operators, executor choices (LocalExecutor / CeleryExecutor / KubernetesExecutor), and secret management.

Code (a realistic Airflow 2.x DAG with deferrable sensors).

# dags/daily_revenue.py
from airflow import DAG
from airflow.providers.amazon.aws.sensors.s3 import S3KeySensor
from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator
from datetime import datetime, timedelta

with DAG(
    "daily_revenue",
    start_date=datetime(2026, 1, 1),
    schedule="0 2 * * *",
    catchup=True,
    max_active_runs=4,
    default_args={"retries": 3, "retry_delay": timedelta(minutes=5)},
) as dag:

    wait = S3KeySensor(
        task_id="wait_for_orders",
        bucket_key="orders/{{ ds }}/_SUCCESS",
        bucket_name="datalake",
        deferrable=True,   # frees the worker slot while waiting
        timeout=60 * 60,
    )

    load = SnowflakeOperator(
        task_id="load_daily_revenue",
        sql="""
          INSERT INTO warehouse.daily_revenue
          SELECT category, SUM(amount) AS revenue, COUNT(*) AS orders
          FROM staging.orders
          WHERE order_date = '{{ ds }}'
          GROUP BY category;
        """,
        snowflake_conn_id="snowflake_default",
    )

    wait >> load
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. start_date + catchup=True — backfills automatically from start_date if the DAG falls behind. Idempotent task design makes this safe.
  2. {{ ds }} — the execution date macro; the task uses it to scope the SQL to one day. Re-runs for the same day produce the same output (idempotent).
  3. Deferrable sensor — the S3 sensor releases its worker slot while waiting; the Airflow triggerer handles the polling. Avoids the "sensor starvation" of the old poke mode.
  4. max_active_runs=4 — bounds concurrent DAG runs during a backfill; protects the warehouse from a thundering herd.
  5. retries=3, retry_delay=5min — handles transient warehouse/S3 errors automatically.

Output.

Knob Choice Why
schedule 0 2 * * * (daily at 02:00) downstream warehouse cron lines up
catchup True safe because tasks are idempotent
deferrable True on sensors frees worker slots during waits
max_active_runs 4 backfill without overloading the warehouse
Executor KubernetesExecutor isolated pod per task; right default for 2026

Common beginner mistakes.

  • Non-idempotent tasks (INSERT instead of MERGE / INSERT OVERWRITE).
  • Sensor in poke mode hogging a worker slot for hours.
  • Hard-coded credentials in params instead of Connection + Secrets Backend.

Cloud — AWS, GCP, or Azure (pick one)

Detailed explanation. Every senior DE owns one cloud cold and reads the other two. The 2026 baseline: IAM (least privilege), object storage (S3 / GCS / ADLS), one compute primitive (Lambda / Cloud Functions / Azure Functions), one managed ETL (Glue / Dataflow / Data Factory), and one VPC / networking story.

Code (a typical S3 + IAM least-privilege pattern in Terraform).

# main.tf — least-privilege IAM for a data pipeline
resource "aws_iam_role" "etl" {
  name = "etl-pipeline"
  assume_role_policy = jsonencode({
    Version = "2012-10-17",
    Statement = [{
      Effect = "Allow",
      Principal = { Service = "lambda.amazonaws.com" },
      Action = "sts:AssumeRole"
    }]
  })
}

resource "aws_iam_policy" "etl_rw_one_bucket" {
  name = "etl-rw-one-bucket"
  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [{
      Effect = "Allow",
      Action = ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
      Resource = [
        "arn:aws:s3:::datalake-prod",
        "arn:aws:s3:::datalake-prod/etl/*"
      ]
    }]
  })
}

resource "aws_iam_role_policy_attachment" "attach" {
  role       = aws_iam_role.etl.name
  policy_arn = aws_iam_policy.etl_rw_one_bucket.arn
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. assume_role_policy — only Lambda can assume the role. No wildcard principals.
  2. Inline Action list — the role can GetObject, PutObject, and ListBucket. Not s3:*. Not *.
  3. Resource scoped to one bucket + one prefix — blast radius bounded to the etl prefix of the prod datalake.
  4. No data-plane policy on the prod warehouse role — the warehouse role is a separate identity with no S3 write access.
  5. Every cloud has an equivalent. GCP service account + IAM binding; Azure managed identity + RBAC role assignment.

Output.

Service AWS GCP Azure
Object store S3 GCS ADLS Gen2
Compute (FaaS) Lambda Cloud Functions Azure Functions
Managed ETL Glue Dataflow Data Factory
Identity IAM IAM + service accounts RBAC + Managed Identity
Secret store Secrets Manager Secret Manager Key Vault

Common beginner mistakes.

  • s3:* on * "just to get it working." Permanent prod state.
  • Using static IAM access keys instead of IRSA / Workload Identity / Managed Identity.
  • One IAM role for the whole pipeline instead of per-task least privilege.

Warehouses — Snowflake, BigQuery, Redshift (pick one)

Detailed explanation. The warehouse is the daily working surface for 70% of senior DE work. Snowflake / BigQuery / Redshift differ in clustering vs partitioning semantics, cost model (credits vs slots vs nodes), and SQL flavor — but every senior probe is the same: scan reduction, clustering keys, cost containment.

Code (a Snowflake performance pattern — clustering + result cache).

-- Cluster the fact by the most-used predicate
ALTER TABLE warehouse.fact_orders CLUSTER BY (order_date, user_id);

-- Use the result cache: identical queries within 24h are free
SELECT category, SUM(amount) AS revenue
FROM warehouse.fact_orders
WHERE order_date BETWEEN '2026-05-01' AND '2026-05-30'
GROUP BY category;

-- Warehouse-level cost knob — auto-suspend keeps idle cost near zero
ALTER WAREHOUSE etl_wh SET
  AUTO_SUSPEND = 60,           -- suspend after 60s idle
  AUTO_RESUME = TRUE,
  MIN_CLUSTER_COUNT = 1,
  MAX_CLUSTER_COUNT = 4;       -- scale-out under contention
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. CLUSTER BY (order_date, user_id) — Snowflake reorganises micro-partitions on this key, so range scans on order_date (the most common predicate) skip 90%+ of micro-partitions.
  2. Result cache — the same query within 24h with no underlying-data change is served from cache; you don't pay credits for it.
  3. AUTO_SUSPEND=60 — the warehouse stops billing 60s after idle. Critical for cost.
  4. Multi-cluster warehouse — under contention, additional clusters spin up; under no contention, you pay for one.

Output.

Operation Without clustering With clustering on (order_date, user_id)
WHERE order_date = ... scans full table scans ~1% of micro-partitions
Cost per query $$$ $
Result cache hit reads micro-partitions free

Common beginner mistakes.

  • Clustering on a low-cardinality column ("status").
  • Querying without a predicate on the clustering key.
  • Leaving AUTO_SUSPEND at the default 10 minutes for a sporadically-used warehouse.

dbt — models, refs, sources, tests, macros

Detailed explanation. dbt is the analytics-engineering layer that turns SQL into a tested, documented DAG. Every senior probe: ref() and source() discipline, tests (not_null, unique, relationships, custom singular tests), incremental models, macros, and dbt run --select state:modified+ in CI.

Code (a realistic incremental model with tests).

-- models/marts/fact_orders.sql
{{
  config(
    materialized = "incremental",
    unique_key   = "order_id",
    incremental_strategy = "merge",
    on_schema_change = "append_new_columns"
  )
}}

WITH src AS (
  SELECT *
  FROM {{ source('raw', 'orders') }}
  {% if is_incremental() %}
    WHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})
  {% endif %}
)
SELECT
  order_id,
  user_id,
  product_id,
  amount,
  order_date,
  updated_at
FROM src;
Enter fullscreen mode Exit fullscreen mode
# models/marts/schema.yml
version: 2
models:
  - name: fact_orders
    description: "Order grain  one row per order, latest version."
    columns:
      - name: order_id
        tests: [not_null, unique]
      - name: user_id
        tests:
          - relationships:
              to: ref('dim_user')
              field: user_id
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. materialized = 'incremental' — dbt builds the table once and merges deltas on subsequent runs; never re-processes the full history.
  2. unique_key = 'order_id' + incremental_strategy = 'merge' — the engine deduplicates on the key; updates flow through.
  3. is_incremental() macro — restricts the source scan to rows newer than the existing max updated_at in the target.
  4. source() — the contract with the raw layer; renaming the raw table breaks the dbt build immediately.
  5. Tests — every PR runs not_null, unique, and relationships on every model. Failures block the merge.

Output.

dbt primitive Effect
ref() enforces DAG ordering + lineage
source() declares raw-layer dependencies
tests block merges on data-quality violations
incremental avoids reprocessing history
state:modified+ CI only builds what changed + its descendants

Common beginner mistakes.

  • Hand-rolling INSERT SQL instead of using incremental materialisation.
  • Skipping the unique + not_null tests on the primary key.
  • Forgetting state:modified+ in CI — slow PR cycles.

System design — the seam that tests everything at once

Detailed explanation. System design is the bridge skill — it requires every Tier-1 and Tier-2 skill at once. The frame: read the prompt → ask 3 clarifying questions → sketch the pipeline → discuss trade-offs → handle failure modes → estimate cost. The probe rewards trade-off reasoning over "right answers."

Code (a system-design checklist you can run mentally in any loop).

SYSTEM DESIGN CHECKLIST — 45-minute round

1. Clarify (3 minutes)
   - What's the volume?  (events/sec, GB/day, rows/year)
   - What's the SLA?     (freshness, downtime budget, query latency)
   - Who's the consumer? (dashboard, ML, ops, regulator)

2. Sketch the pipeline (10 minutes)
   - Source → ingest → storage → transform → serve
   - Mark batch vs streaming on every edge

3. Trade-offs (15 minutes)
   - Lambda vs Kappa (two systems vs one)
   - Idempotence pattern (UPSERT, MERGE, dedupe key)
   - Schema evolution (BACKWARD compat, contracts)
   - Storage format (Parquet, Iceberg, Delta)

4. Failure modes (10 minutes)
   - Source dies for 4 hours — what happens?
   - Schema drift — who breaks first?
   - 10× spike — where does it fall over?
   - Bad data backfilled — how do you recover?

5. Cost estimate (7 minutes)
   - Storage: GB/day × $/GB × retention
   - Compute: hours × $/hour
   - Network egress: GB/day × $/GB
   - Warehouse credits: queries/day × avg credit/query
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Always clarify first. A confident "let me make sure I understand the prompt — 3 things..." is the single highest-signal opening move.
  2. Sketch on the board immediately. Boxes + arrows + a one-word label per box. Don't talk for 10 minutes without drawing.
  3. Trade-off > right answer. "I'd reach for Kappa because we already have Kafka and one fewer system to operate; if the analytical patterns required heavy historical re-processing, I'd switch to Lambda."
  4. Failure modes test maturity. Anyone can draw a happy path; senior DEs draw the 4am page.
  5. Cost closes the round. "At 100k events/sec × 1KB × 30-day retention × 3× replication = 78TB of Kafka storage" — a back-of-envelope number wins points.

Output.

Round phase Senior signal Junior anti-signal
Clarify 3 sharp questions dives into design
Sketch starts drawing in minute 4 talks for 15 minutes
Trade-offs names 2 alternatives + the chosen one "this is the best way"
Failure modes draws the 4am page "shouldn't happen"
Cost back-of-envelope number "I don't know"

Common beginner mistakes.

  • Jumping to "I'd use Kafka and Spark" without clarifying volume.
  • Not drawing on the board.
  • Never naming a trade-off.

DE-skills interview question on choosing batch vs streaming

A frequent on-site probe: "Should this pipeline be batch or streaming?" — testing whether the candidate can map SLA → architecture.

Solution Using the freshness × volume × cost matrix

def batch_or_stream(freshness_sla_minutes, volume_events_per_sec, ops_complexity_budget):
    """Return 'batch', 'streaming', or 'mini-batch' based on the requirement matrix."""
    if freshness_sla_minutes <= 1:
        return "streaming"          # sub-minute SLA → no choice
    if freshness_sla_minutes >= 60 and volume_events_per_sec < 1000:
        return "batch"              # hourly+ SLA, low volume → boring is good
    if freshness_sla_minutes <= 15 and ops_complexity_budget == "low":
        return "mini-batch"         # 5–15 min, but team can't run streaming
    return "streaming"

print(batch_or_stream(freshness_sla_minutes=5, volume_events_per_sec=50_000, ops_complexity_budget="high"))
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Scenario freshness volume ops budget decision
Real-time fraud 1 sec 5,000/s high streaming
Daily revenue dashboard 1 hour 200/s low batch
Marketing attribution 10 min 800/s low mini-batch
Ad bidding 100 ms 50,000/s high streaming

Output:

Pipeline shape When to pick Cost characteristic
Batch (Airflow + warehouse SQL) hourly+ SLA, low complexity tolerance cheapest, simplest
Mini-batch (5–15 min runs) sub-hour SLA, low ops budget middle ground
Streaming (Kafka + Flink/Spark Structured Streaming) sub-minute SLA or high event rates most expensive to run

Why this works — concept by concept:

  • SLA leads — freshness requirement is the only forcing function that takes batch off the table. Everything else is preference.
  • Volume matters second — at very high event rates batch becomes operationally awkward (huge batches that miss SLAs).
  • Ops complexity is the hidden axis — streaming requires on-call, watermarks, late-event handling, schema-evolution discipline. Teams without that maturity should stay in batch as long as the SLA allows.
  • Mini-batch is the polite compromise — Spark Structured Streaming with trigger=AvailableNow gives sub-15-minute latency without a true streaming on-call burden.
  • Cost — batch ≈ 1×; mini-batch ≈ 2×; streaming ≈ 4–6× in cluster + on-call cost.

Streaming
Topic — streaming
Streaming pipeline problems (Kafka, Spark, EOS)

Practice →

Design
Topic — design
DE system-design drills

Practice →

Data
Topic — ETL
End-to-end ETL drills

Practice →


5. Soft skills — stakeholders, on-call, communication, mentoring, prioritisation

The 5 soft skills are tested in behaviourals and gated for senior+ — not optional "people skills"

These five are the data engineer soft skills every modern DE loop probes in the behavioural round and every performance-review cycle measures after the hire. They are not "nice extras." A staff-engineer promotion packet that's strong on the technical 10 but weak on these 5 stalls — every time.

Visual radial wheel diagram of the 5 DE soft skills — five wedges (Stakeholder management, On-call discipline, Communication, Mentoring, Prioritisation) each with a tiny icon and a one-line action; a small annotation about how each is tested in behavioural rounds; on a light PipeCode card.

The behavioural-round mapping.

  • "Tell me about a time you partnered with a non-technical stakeholder" → stakeholder management.
  • "Walk me through your worst on-call incident" → on-call discipline.
  • "Tell me about a design doc you wrote that was controversial" → communication.
  • "Tell me about a junior you helped grow" → mentoring.
  • "Tell me about a time you said no" → prioritisation.

STAR is the answer shape — every story, every time.

  • Situation — set the stage in one sentence.
  • Task — what you specifically owned.
  • Action — what you did (always "I", not "we").
  • Result — measurable outcome with a number.

Stakeholder management — translate business asks into SQL specs

Detailed explanation. Stakeholder management is the muscle of turning a vague marketing or product question ("why is signup conversion down?") into a clear data spec ("week-over-week conversion by traffic source for paid signups, excluding internal-test users, comparing last 8 weeks") and back to a business answer the stakeholder can act on. The senior signal is expectation management — saying "the v0 lands in 3 days; v1 with the cohort dimension in 2 weeks" instead of nodding silently and missing.

STAR template (stakeholder management).

SITUATION (1 sentence)
The marketing team came to me with a vague question: "why is signup conversion down?"

TASK (1 sentence)
I owned turning that into a data spec, building the metric, and reporting back in <1 week.

ACTION (3-5 short sentences)
- I scheduled a 30-min scoping call and wrote a shared one-pager with 4 candidate definitions.
- We picked "paid signups within 24h of first-touch, week-over-week, by source" together.
- I shipped a v0 SQL view in 3 days that any analyst could query.
- I set up a weekly review meeting and a #marketing-data Slack channel for follow-ups.

RESULT (1 sentence with a number)
The marketing team caught a 9% drop attributable to one paid channel within 2 weeks
and saved an estimated $40k of mis-attributed spend that quarter.
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Scope-call + one-pager — turns vague ask into precise spec; killer for unwritten expectations.
  2. Pick the definition together — the stakeholder owns the metric; you own the SQL. Joint ownership prevents future "that's not what I meant."
  3. Ship v0 fast — momentum beats perfection; v0 surfaces the next questions.
  4. Set up the channel — async > meetings; the Slack channel scales with team size.
  5. Result with a number — "saved $40k" is the senior shape; "they were happy" is the junior shape.

Output.

STAR beat Senior shape Junior shape
Situation 1 sentence 5 minutes of context
Task "I owned X" "we kind of did X"
Action 4 sharp first-person actions 8 vague we-statements
Result quantified business impact "they were happy"

On-call discipline — runbooks, alerting hygiene, postmortems

Detailed explanation. On-call discipline is the muscle of running a data platform 24/7 without burnout. The four pillars: runbooks (every page has a documented response in <2 minutes), alerting hygiene (zero false positives; pages are real), postmortems (every incident gets a blameless write-up within 5 business days), paging budget (every team owns its noise; a noisy team is asked to fix it).

STAR template (on-call discipline).

SITUATION
We had a Kafka consumer-lag page that fired at 3am every other week.

TASK
I owned the on-call rotation for the streaming platform; this page was killing morale.

ACTION
- I pulled the last 6 weeks of incidents — 80% of the pages were resolved by "just wait."
- I wrote a runbook for the remaining 20% (real consumer-group rebalance issues).
- I raised the alert threshold from 1k lag to 50k lag and added a 10-min sustain window.
- I added a recovery-action playbook to the on-call dashboard.
- I scheduled a postmortem template-walkthrough for the team.

RESULT
Pages dropped 70% within one month; MTTR on real incidents fell from 25 min to 8 min
because the runbook was already open.
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Pull the data first — "80% resolved by just wait" is the killer line. Data > opinion in incident reduction.
  2. Runbook for the real ones — the 20% deserve documentation; the 80% deserve to not page at all.
  3. Threshold + sustain — raises specificity. Most on-call noise is noisy thresholds.
  4. Recovery dashboard — the on-call engineer should reach for the dashboard before reaching for code.
  5. Postmortem cadence — blameless, time-boxed, and templated. Every incident becomes a teaching artifact.

Output.

On-call hygiene primitive Effect
Runbook per alert MTTR drops
Threshold + sustain window false positives drop
Blameless postmortem repeat incidents drop
Paging budget per team team owns its noise

Communication — RFCs, design docs, async-first

Detailed explanation. Senior-DE communication is async-first written communication — RFCs, design docs, weekly status updates, postmortems. The signal: a 1-page design doc that a sceptical reviewer can read in 5 minutes and either approve or comment on with crisp blockers. Verbal communication matters too (1:1s, standups) but written is the leverage point.

STAR template (communication).

SITUATION
Our data platform was considering switching from Airflow to Dagster.

TASK
I owned writing the RFC that would gate the decision.

ACTION
- I wrote a 2-page RFC: problem statement, 3 options, recommendation, cost, rollback.
- I held two 30-min open-comment sessions for the 12 engineers on the platform team.
- I addressed every comment in a public reply thread with my reasoning.
- I scheduled a decision meeting with the directors and walked the doc in 10 minutes.
- I shipped a 1-page summary post-decision so the org could ramp on the choice.

RESULT
Decision made in 2 weeks (vs. the typical 8-week stall pattern); zero post-decision
litigation because every concern was on the public record.
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. RFC is the artifact — not a meeting; not a Slack thread; not a deck. The RFC is the source of truth.
  2. Open-comment session — async first, sync to unblock. Anyone who didn't comment forfeits the right to litigate later.
  3. Public reply thread — every concern is acknowledged in writing. Silence on a concern is the worst signal.
  4. Decision meeting walks the doc, not invents new content — saves 90% of decision-meeting time.
  5. Summary post — the doc is rich; the summary is what 50 other engineers will actually read.

Output.

Communication artifact Reader effort When to reach for it
1-page summary 1 min broad org broadcast
2-page RFC 5 min every meaningful technical decision
Design doc (5–10 pages) 15 min quarter-scale projects
Postmortem (template) 10 min every incident
Weekly status 2 min every Friday, every project

Mentoring — onboard juniors, code reviews that teach

Detailed explanation. Mentoring is the muscle of growing other engineers without burning out yourself. The pillars: 30-day onboarding plan (with weekly milestones), code reviews that teach (the comment explains why, not just what), paired debugging (one screen, two engineers, on the real bug), career conversations (quarterly, not "next promo cycle").

STAR template (mentoring).

SITUATION
A new-grad joined my team with strong CS fundamentals but no production DE experience.

TASK
I owned their 30-day onboarding plan and weekly 1:1s.

ACTION
- I wrote a 1-page 30-day plan: Week 1 SQL deep dive, Week 2 dbt model, Week 3 Airflow DAG, Week 4 end-to-end ticket.
- I paired with them for 1 hour on day 1, day 3, day 5, then 1:1s weekly.
- Every code review comment of mine included a "here's the why" link to a doc or a paper.
- I made sure they shipped a real production change in week 2 (small, but real).
- I introduced them to 5 senior engineers on adjacent teams in their first 2 weeks.

RESULT
They were taking on independent tickets by week 5 (target was week 6) and ranked
"strongly exceeds" on their 6-month review.
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. 30-day plan — removes ambiguity for the junior and creates accountability for the mentor.
  2. Paired debugging — the highest-bandwidth knowledge transfer. One hour of pairing = 5 hours of solo struggle.
  3. Code-review "why" — every comment links to a doc, a paper, or a prior incident. Teaches reasoning, not rules.
  4. Real production change in week 2 — momentum matters more than safety; small + real beats big + simulated.
  5. Adjacent-team intros — builds the junior's network; reduces their "everything goes through me" burden.

Output.

Mentoring primitive Junior-side effect Mentor-side cost
30-day plan clear weekly milestones 2 hours to write
Paired debugging rapid skill transfer 1 hour per session
Why-linked code review teaches reasoning 5 min per PR vs 2 min
Real prod change in week 2 confidence + ownership one extra reviewer cycle

Prioritisation — saying no, defending scope, OKR alignment

Detailed explanation. Prioritisation is the muscle of saying no to a stakeholder without burning the relationship, and defending scope against a sceptical PM without becoming a roadblock. Senior DEs hold the line because they tie every request to OKRs and surface trade-offs explicitly.

STAR template (prioritisation).

SITUATION
Three product teams asked for "urgent" dashboard work in the same week.

TASK
I owned weekly prioritisation for our 4-engineer platform team.

ACTION
- I wrote a 1-pager mapping each ask to the company's two Q-OKRs (revenue + retention).
- Ask A mapped to revenue (P0); Ask B mapped to retention (P1); Ask C mapped to neither (P3).
- I took the 1-pager to the requesting PMs in a 20-min sync and said: "we can ship A this week, B next sprint, and C in Q+1 unless you can re-map it to an OKR."
- Ask C's PM brought a re-map argument; we accepted the re-map and bumped C to P2.
- I sent a recap email with the final priorities and timelines.

RESULT
Zero re-litigation of priorities for the quarter; team velocity rose 30% because
context-switching went down.
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Map every ask to an OKR — the OKR is the shared currency; arguments framed in OKR-language carry weight.
  2. Make the "no" specific — not "no", but "we can ship A; B slips; C goes to Q+1 unless..."
  3. Give the requester a path — "unless you can re-map" is generous; it lets the requester salvage their ask.
  4. Recap in writing — the recap is the contract; reduces silent expectation drift.
  5. Velocity rises because context-switching falls — the senior signal is "I made the team faster by saying no to noise."

Output.

Prioritisation primitive Effect on team
OKR mapping shared currency for arguments
Specific no + path preserves the relationship
Written recap reduces silent drift
Said-no number counter-intuitive promotion signal

DE-skills interview question on a controversial trade-off

A senior behavioural probe: "Tell me about a time you had to defend a technical decision in front of a sceptical audience." — testing communication + prioritisation + influence at once.

Solution Using the STAR + trade-off-and-counter shape

SITUATION (15s)
Our streaming pipeline was missing the 99.9% freshness SLA twice a month.
The product team wanted us to migrate from Spark Structured Streaming to Flink.

TASK (10s)
I owned the recommendation; the platform director wanted my call in 2 weeks.

ACTION (90s)
- I instrumented the pipeline and found 80% of misses came from one upstream connector,
  not the stream engine itself.
- I wrote an RFC with three options: (a) migrate to Flink, (b) fix the connector,
  (c) raise the SLA from 99.9% to 99.5% and revisit in 6 months.
- I held an open-comment session; the product team initially pushed for (a).
- I walked the numbers: (a) cost = 12 engineer-weeks + new on-call burden, (b) cost
  = 2 weeks + zero new operational surface, (c) cost = zero engineering but stakeholder
  trust hit.
- I recommended (b) and committed to a 2-week timeline + a rollback to (a) if misses
  didn't drop below 1/month.
- Product team agreed because the trade-off was on paper, not in our heads.

RESULT (15s)
We shipped the connector fix in 9 days; misses dropped to 0 in the following 8 weeks;
saved an estimated 10 engineer-weeks of unnecessary platform migration.
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

STAR beat Time Senior cue
Situation 15s one sentence, no padding
Task 10s "I owned" — first-person ownership
Action 90s 4 sharp action verbs + named trade-off
Result 15s quantified savings + a number

Output:

Story dimension Senior shape Junior shape
Length 2 minutes total 6 minutes total
Voice "I" "we"
Data numbers in the action numbers absent
Outcome engineer-weeks saved "they were happy"
Decision artifact RFC + open comment meeting + Slack

Why this works — concept by concept:

  • Data-first action — instrumenting before recommending is the senior move. It earns the right to push back on the product team's preferred answer.
  • RFC with 3 options — never bring a recommendation without alternatives; the alternatives are how trust is earned.
  • Quantified trade-off — "12 engineer-weeks + new on-call burden" beats "this would be a lot of work." Numbers unlock decisions.
  • Rollback commitment — "rollback to (a) if misses don't drop below 1/month" is the reversibility clause that lets the team commit.
  • Result with a number — "saved 10 engineer-weeks" is the bar-raiser-ready close.
  • Cost — preparing one STAR story takes 30–45 minutes; the muscle of thinking in STAR comes from doing 10–15 of them.

Behavioural
Course — behavioural prep
Behavioural interview prep for DEs

Practice →

Product
Course — product sense + metrics
Product sense and metrics for DE interviews

Practice →


Cheat sheet — DE skills self-assessment grid

The grid below is a 15-row self-rating sheet — 10 technical + 5 soft. Rate yourself on each row against the three columns (Junior / Mid / Senior expectation), then circle your three weakest skills as study priorities for the next quarter.

Skill Junior expectation Mid expectation Senior expectation
SQL joins, GROUP BY window fns, CTEs, EXPLAIN recursive CTEs, plan tuning, dialect quirks
Python functions, dict, list pandas, requests, pytest asyncio, SQLAlchemy, performance profiling
Dimensional modelling fact + dim concept grain, SCD types 1/2 conformed dims, bus matrix, SCD type 6
Apache Spark DataFrames basics shuffle, broadcast, AQE partition pruning, perf tuning at scale, custom UDFs
Apache Kafka topics + partitions consumer groups, acks EOS, transactions, KRaft, zombie fencing
Airflow write a DAG sensors, retries, backfills deferrable operators, custom executors, secret backends
Cloud console clickops IAM least privilege, S3/GCS, FaaS Terraform, multi-region, cost ownership
Warehouses write a SELECT clustering, partitioning, cost warehouse-level cost optimisation, governance
dbt model + ref tests, sources, incremental macros, exposures, state:modified+ in CI
System design one-box pipeline trade-off naming failure-mode + cost in the answer
Stakeholder management take a ticket translate ask → spec manage expectations + drive priorities
On-call discipline follow a runbook tune alerts, write runbooks own the rotation, drive paging budget
Communication Slack thread 1-page summary 2-page RFC + open-comment sessions
Mentoring help when asked onboard a junior in 30 days grow a team; codify the curriculum
Prioritisation finish the sprint scope the quarter say no with OKR mapping; defend scope

How to use the grid.

  • Self-rate each row Junior / Mid / Senior against where you actually operate today.
  • Star three rows where you're lower than the level you're interviewing for.
  • For each starred row, pick one practice loop from PipeCode and run it weekly.
  • Re-rate every 8 weeks; you should see at least 2 rows shift by one level per quarter.

Frequently asked questions

How many programming languages should a data engineer know?

One language cold — usually Python — plus SQL as a first-class language, plus enough shell/bash to be dangerous on a Linux box. A second language (Scala or Java) is a strong "nice to have" for JVM-stack shops (Spark on Scala, Kafka Streams in Java) but not a gate for most 2026 DE roles. Don't chase a third language unless your target team uses it daily; depth beats breadth in the loop.

Is Java required for data engineering?

Not for most DE roles in 2026. Spark on Python (PySpark) covers the majority of distributed-compute work; Kafka clients in Python (confluent-kafka) cover most messaging work. Java becomes important if you want to write Kafka Streams applications, contribute to Apache projects, or work at a Spark-on-Scala shop like Netflix or Databricks. If you have to pick one, pick Python and reach for Scala or Java when the team you're joining uses them.

How important is system design for DE interviews?

Very — system design is the seam round that tests every other skill at once. Most senior+ loops include one 45–60 minute system-design round; staff+ loops include two. The probe rewards trade-off reasoning over right answers — clarify the prompt, sketch the pipeline, name two alternatives, draw the failure modes, and put a back-of-envelope cost on it. A weak system-design round will tank an otherwise strong loop.

Do I need to know machine learning for DE?

Not deeply. The bar is "you can ingest features, store them with low-latency lookup, schedule retraining, and not lose track of model lineage" — i.e. you understand the data side of ML platforms. You do not need to derive gradient descent or tune transformer hyper-parameters. If your target team is ML-infrastructure-heavy (Uber's Michelangelo, Spotify's Hendrix), spend 2–3 weeks on feature stores, model registries, and online inference patterns — not on the math.

Are soft skills tested in DE interviews?

Yes, in every senior+ loop. The behavioural round is typically 45–60 minutes of STAR questions across the five soft skills — stakeholder management, on-call discipline, communication, mentoring, and prioritisation. Bar-raisers and hiring managers weigh this round at least as heavily as the technical rounds. Prepare 8–12 STAR stories, each ~2 minutes long, that you can map to any of the five themes; the muscle is in the mapping, not the memorising.

Which skill should I learn next if I already know SQL + Python?

In order: (1) one warehouse cold (Snowflake or BigQuery — clustering, partitioning, cost) → (2) dbt — models, refs, sources, tests, incremental → (3) Airflow — DAGs, sensors, deferrable operators → (4) Spark — DataFrames, shuffle, broadcast joins. After that pick one of Kafka (if your target team is streaming-heavy) or system design (if your target loop has a design round). Resist the urge to start three at once; finish one before opening the next.

Practice on PipeCode

Pipecode.ai is Leetcode for Data Engineering — every technical skill in the 10-skill grid above ships with a hands-on practice room where you write the SQL, fix the Spark plan, or trace the Kafka rebalance on real data. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, plus the behavioural course bundle for the soft 5.

Practice SQL window functions now →
DE system-design drills →

Top comments (0)