big data engineering in 2026 is a system-design discipline, not a tooling choice. The keyword used to evoke racks of HDFS commodity hardware, a Java MapReduce job, and a Cloudera support contract. Today it evokes a five-layer cloud stack — object storage at the base, an open table format for ACID, a compute engine (Spark or Flink) that is interchangeable on top, a streaming backbone (Kafka) that turns the warehouse into a real-time substrate, and a governance plane that finally treats data assets as first-class citizens. The job title is the same; the stack is unrecognisable.
This roadmap is the answer to the question every early-career engineer asks when they first see "Hadoop, Spark, Kafka" on a job posting: what do I actually need to know in 2026, what is safe to skip, and in what order do I learn it? It walks the 3 V evolution from 2010 to 2026, maps the modern stack layer by layer, settles the Lambda vs Kappa architecture debate, and ends with a month-by-month 6-month learning ladder built around pyspark, kafka, iceberg, and the lakehouse pattern. Each section pairs a teaching block with a worked interview answer — code, step-by-step trace, output table, and a concept-by-concept why-it-works.
When you want hands-on reps alongside this reading, drill the ETL practice library → for pipeline design, rehearse on streaming problems → for the Kafka / Flink surface, and stack the engine-tuning muscles with the Apache Spark internals course →.
On this page
- What "big data engineering" actually means in 2026
- The 3 V evolution — from Hadoop 2010 to Lakehouse 2026
- The 2026 stack map — Hadoop ecosystem, Spark, Kafka, Lakehouse
- Batch vs streaming — Lambda is dead, Kappa rules
- The 6-month learning ladder for 2026
- Cheat sheet — big data recipes
- Frequently asked questions
- Practice on PipeCode
1. What "big data engineering" actually means in 2026
Big data engineering is the discipline of designing distributed, fault-tolerant pipelines that move and transform data at a scale where a single machine no longer fits — and at a velocity where minutes of latency are no longer acceptable
The one-sentence invariant: "big data" stopped being a size threshold in 2018 and became a system-design threshold — the moment your workload requires distributed compute, fault-tolerant storage, or horizontal scaling, you are doing big data engineering, regardless of whether the table is 10 GB or 10 PB. The 2010 definition ("the 3 Vs — volume, velocity, variety") was a marketing framework; the 2026 definition is structural.
Big data engineering vs traditional data engineering.
- Volume axis. Traditional DE happily ships a Postgres + Airflow + dbt stack up to ~1 TB of warm data per table. Big data DE starts when a single table no longer fits on one machine — at which point you need partitioning, distributed file storage, and a compute engine that can read partitions in parallel.
- Velocity axis. Traditional DE assumes batch — the freshest data is yesterday's. Big data DE includes streaming — the freshest data is seconds old, ingested through Kafka or Kinesis and processed by Flink or Spark Structured Streaming.
- Variety axis. Traditional DE assumes relational tables. Big data DE handles JSON, Parquet, Avro, ORC, images, video, embeddings, vector indices, and CDC change logs — sometimes in the same pipeline.
- Veracity axis. Traditional DE trusts the source. Big data DE assumes the source lies, drops rows, replays events, and emits duplicates — so every pipeline must be idempotent and every sink must dedupe.
Why "big" is no longer a size threshold.
The historical threshold was "petabytes." It is now closer to "the moment your stack needs more than one machine, or the moment your SLA is shorter than the batch interval of your warehouse." A modern fintech with 80 GB / day of trades and a sub-second SLA is doing big data engineering. A scientific archive with 50 TB / year stored as plain Parquet on S3, queried twice a month with Trino, is not — even though the dataset is technically larger.
The 3 archetypes a modern big data engineer ships.
- Warehouse. The classic batch ETL: nightly Spark jobs read from S3, transform in PySpark or dbt, write to Snowflake or BigQuery. BI consumes it the next morning.
- Lakehouse. Same source data, but written to Iceberg or Delta tables on S3 with ACID semantics. Spark, Trino, Flink, and Snowflake all read the same files; no copies.
- Streaming pipeline. Kafka topic → Flink or Spark Structured Streaming → Iceberg (or a serving layer like Pinot / Druid). End-to-end latency in seconds; backfill by replaying the Kafka topic from offset 0.
Why Hadoop is still in the curriculum.
Almost nobody runs MapReduce in 2026. But two of Hadoop's children — HDFS-style replication and the YARN resource model — quietly underpin most modern engines. Spark inherited the partition / shuffle / executor mental model from Hadoop. Hive's table abstraction became Iceberg's spec. Even Snowflake's micro-partitions are HDFS blocks rebranded. Skipping Hadoop history means missing the why behind every optimisation choice in Spark, Iceberg, and Delta.
Who needs this stack — and who genuinely does not.
- Need it: FAANG, fintech, ad-tech, IoT, telco, ride-share, streaming media, large e-commerce. Anywhere the data per day exceeds a hundred GB or the latency SLA drops below an hour.
- Do not need it (yet): early-stage SaaS, internal tools, most analytics startups under Series B. Postgres + dbt + Airflow scales further than the internet thinks. The mistake is reaching for Spark when a well-indexed Postgres would have shipped in a week.
The petabyte litmus test.
The honest threshold for graduating from "Postgres + Airflow" to "Spark + Iceberg" is not a TB number — it is three operational signals firing at once:
- A single query routinely takes longer than your batch interval.
- A single table no longer fits in your warehouse's largest cluster comfortably.
- You need to back-process years of historical data on demand (e.g. for a feature replay or a regulatory audit).
When two of those three are true, the rewrite to the lakehouse pays for itself in operational savings within a quarter. Before that, the rewrite is premature optimisation.
Worked example — sizing the stack against a real workload
Detailed explanation. A team running 50 GB / day of clickstream on a Postgres + Airflow stack is asked to add a "last 5 minutes of activity" dashboard. The naive plan is "schedule Airflow every 5 minutes." The senior plan is "this is the signal that the team has crossed into streaming territory; rewrite the ingestion path through Kafka and a streaming engine."
Question. Given a clickstream workload of 50 GB / day with a new 5-minute latency SLA, justify whether the team should stay on Postgres + Airflow, add Kafka + Spark Structured Streaming, or rewrite to a lakehouse. List the three deciding signals.
Input — workload snapshot.
| Dimension | Today | New requirement |
|---|---|---|
| Volume / day | 50 GB | 50 GB |
| Latency SLA | 24 h (overnight batch) | 5 min |
| Backfill needs | 30 days | 2 years (regulatory) |
| Concurrent readers | 5 analysts | 50+ (live dashboard) |
| Stack | Postgres + Airflow + dbt | TBD |
Code (pseudocode decision matrix).
def stack_choice(volume_gb_day, sla_minutes, backfill_days, readers):
streaming = sla_minutes <= 30
lakehouse = volume_gb_day * backfill_days > 10_000 # > 10 TB of warm history
high_concurrency = readers > 25
if streaming and lakehouse:
return "Kafka + Flink/Spark Streaming + Iceberg + Trino"
if streaming and not lakehouse:
return "Kafka + Spark Structured Streaming + Postgres sink"
if lakehouse:
return "S3 + Iceberg + Spark batch + Trino"
return "Stay on Postgres + Airflow + dbt"
print(stack_choice(50, 5, 730, 50))
Step-by-step explanation.
- The 5-minute SLA forces the streaming branch — no batch interval shorter than 5 minutes will reliably hit it without a Kafka backbone and an incremental engine.
- The 2-year backfill at 50 GB / day = 36 TB of warm history, well above the 10 TB threshold for staying on Postgres. The team needs a lakehouse table.
- 50 concurrent readers means you cannot serve them all from a single Postgres replica; you need a query engine designed for concurrency (Trino, Presto, Snowflake) reading the same lakehouse files.
- The combination of the three signals (streaming SLA, large backfill, high concurrency) returns the full lakehouse + Kafka stack — the rewrite is now justified by operational math, not by hype.
Output.
| Decision | Result |
|---|---|
| Streaming required? | Yes (SLA = 5 min) |
| Lakehouse required? | Yes (warm history = 36 TB) |
| Recommended stack | Kafka + Flink/Spark Streaming + Iceberg + Trino |
Rule of thumb. Do not rewrite to big data tooling because of one signal. Wait until two of the three (streaming SLA, large backfill, high concurrency) fire before paying the migration cost. Until then, vertical-scaling Postgres + better dbt models will outperform a half-built Spark cluster.
Worked example — the "is this big data?" interview probe
Detailed explanation. A senior interviewer hands a candidate a dataset description and asks "is this big data engineering or traditional?" The expected answer is not a number — it is a structural argument naming the system-design pressure that pushes the workload off a single machine.
Question. A logistics company ingests 5 GB / day of GPS pings from 200,000 vehicles. The current stack is Postgres + Airflow nightly. The product team now wants a live map showing each vehicle's latest position with under 10 seconds of latency. Is this big data engineering? Justify the answer with the three structural signals.
Input.
| Dimension | Value |
|---|---|
| Volume per day | 5 GB |
| Source rate | ~2,000 pings / second peak |
| Latency SLA | < 10 seconds |
| Concurrent map viewers | 500 |
| Backfill | "live only" |
Code (signal check).
volume_signal = 5 < 100 # 5 GB/day is small
velocity_signal = 10 < 60 # 10 sec SLA is streaming
variety_signal = True # JSON + geospatial points
veracity_signal = True # ping drops, duplicates expected
concurrency_signal = 500 > 50
is_big_data = velocity_signal or (volume_signal and concurrency_signal)
print(is_big_data) # True
Step-by-step explanation.
- The volume signal is false — 5 GB / day is well under the 100 GB / day rough threshold for "needs a distributed file store."
- The velocity signal is true — a 10-second SLA cannot be met with batch ETL, no matter how short the interval. It needs a streaming engine.
- The concurrency signal is true — 500 concurrent dashboard viewers cannot all hit a single Postgres replica without degrading write throughput.
- The structural answer is: this is big data engineering on the velocity axis, even though the volume is small. The team needs Kafka for the ingestion buffer, a streaming engine for the transform, and a serving store designed for high-concurrency reads (Redis, Pinot, or a materialised view in Postgres).
Output.
| Question | Senior answer |
|---|---|
| Is this big data engineering? | Yes — driven by velocity, not volume |
| What stack? | Kafka → Flink → Pinot / Redis serving layer |
| What is not needed? | Spark batch, lakehouse table format (yet) |
Rule of thumb. Never answer "is this big data?" with a TB number. Answer with the structural signal that fires — volume, velocity, variety, or concurrency — and then name the engine that matches that signal. Interviewers respect the framework more than the magnitude.
Big data interview question on workload classification
A senior interviewer often opens with: "Walk me through how you decide whether a new workload needs a big data stack — what numbers do you look at, what signals do you weigh, and how do you defend the choice to a PM who just wants 'the cheap option'?" The probe blends sizing intuition, streaming awareness, and stakeholder communication.
Solution Using the four-signal workload classifier
def classify_workload(volume_gb_day, sla_minutes,
backfill_days, concurrent_readers,
variety_types):
"""
Returns one of:
- "single-node" : Postgres / DuckDB / single SQL warehouse
- "warehouse" : Snowflake / BigQuery + dbt + Airflow
- "lakehouse" : S3 + Iceberg / Delta + Spark + Trino
- "streaming" : Kafka + Flink / Spark Structured Streaming
"""
streaming = sla_minutes <= 30
warm_tb = volume_gb_day * backfill_days / 1024
distributed_storage = warm_tb > 10
high_concurrency = concurrent_readers > 25
variety = len(variety_types) >= 3 # tabular + json + binary
if streaming:
return "streaming"
if distributed_storage and (high_concurrency or variety):
return "lakehouse"
if distributed_storage:
return "warehouse"
return "single-node"
print(classify_workload(
volume_gb_day=50,
sla_minutes=5,
backfill_days=730,
concurrent_readers=50,
variety_types=["json", "parquet", "avro"],
))
Step-by-step trace.
| Signal | Computed | Threshold | Fires? |
|---|---|---|---|
| Streaming SLA | 5 min | ≤ 30 min | yes |
| Warm history | 50 × 730 / 1024 = 35.6 TB | > 10 TB | yes |
| Concurrency | 50 readers | > 25 | yes |
| Variety | 3 formats | ≥ 3 | yes |
All four signals fire — but the function returns "streaming" because the streaming branch is checked first (the SLA dominates the stack choice). The lakehouse + concurrency signals tell the architect that the streaming pipeline must write to an Iceberg table downstream, not to a single Postgres sink.
Output:
| Input workload | Classifier returns |
|---|---|
| 50 GB / day, 5 min SLA, 730d backfill, 50 readers, 3 formats | streaming (with lakehouse sink) |
Why this works — concept by concept:
- SLA dominates stack choice — a sub-30-minute latency requirement forces a streaming engine on top of any batch design. No amount of cron tightening turns Airflow into a streaming system.
- Warm-history TB drives storage choice — once you must keep more than ~10 TB on warm storage, object storage + a table format is dramatically cheaper than warehouse-native storage.
- Concurrency drives serving layer choice — 25+ concurrent readers either need a managed warehouse (Snowflake / BigQuery) or a query engine pool (Trino / Presto) reading from the lakehouse.
- Variety drives format choice — three or more formats (JSON + Parquet + Avro + images) effectively force a lakehouse: the warehouse-native column-store cannot absorb binary types cheaply.
- Cost — the classifier is O(1) per workload; the real cost is the engineering review that happens when two of the four signals fire and the team must decide whether to rewrite or scale vertically.
DE
Topic — ETL
ETL pipeline problems
2. The 3 V evolution — from Hadoop 2010 to Lakehouse 2026
Sixteen years of big data architecture compress into two structural shifts: HDFS gave way to object storage + table format, and batch gave way to exactly-once streaming
The mental model in one line: the 3 Vs (volume, velocity, variety) grew on every axis between 2010 and 2026, but the interesting growth was on variety — which is what pushed the industry off HDFS and onto open table formats on cloud object storage. Once you can name the two paradigm shifts cold, every "why do you need a lakehouse?" interview question collapses to a sentence.
2010 — the Hadoop baseline.
- Volume: gigabytes / day, sometimes terabytes for the largest web companies. One Cloudera CDH cluster on a rack of commodity hardware was the upper bound for most teams.
- Velocity: strictly batch. The "nightly job" was the design idiom. Latency was 24 hours and that was considered fine.
- Variety: CSV, server logs, the occasional Avro file. Almost everything was tabular or close to it.
- Stack: HDFS for storage, MapReduce for compute, Hive for SQL on top, Pig for procedural transforms, Oozie for scheduling. ZooKeeper somewhere in the corner coordinating leadership.
2015 — the Spark inflection.
- Volume: terabytes / day became routine. Single tables started crossing the 100 GB mark in production.
- Velocity: micro-batch. Spark's 5-minute intervals were the "new fast." Kafka entered the picture as an ingestion buffer.
- Variety: Parquet replaced text logs as the default columnar format. Avro for schema-evolving event streams. JSON in everything.
- Stack: Spark replaced MapReduce for compute (10-100× faster on the same hardware). HDFS still dominated storage but S3 started showing up. Airflow replaced Oozie. Cloudera and Hortonworks merged.
2020 — the streaming era.
- Volume: terabytes / hour for the busiest sources (ad tech, real-time bidding, large e-commerce). The "batch interval" started feeling like the wrong abstraction.
- Velocity: true streaming. Spark Structured Streaming and Flink delivered second-level latencies with exactly-once semantics. Kafka became the central nervous system.
- Variety: geospatial, video, CDC change logs, IoT telemetry. Multi-format pipelines became the norm.
- Stack: S3 for storage (HDFS in maintenance mode), Spark + Flink for compute, Kafka for ingestion, dbt for SQL-side transforms, Snowflake / BigQuery / Databricks for warehouses, Airflow + Dagster for orchestration.
2026 — the lakehouse era.
- Volume: petabyte-scale tables are routine. The largest single Iceberg tables are pushing into the exabyte range at hyperscalers.
- Velocity: sub-second exactly-once is the target for hot paths. Latency tiers are explicitly designed (< 1s for Flink, seconds for Spark Streaming, minutes for micro-batch, hours for batch).
- Variety: vectors, embeddings, multimodal data, real-time ML features. The variety axis is what drives stack choice — more than raw volume.
- Stack: S3 / GCS / Azure Blob for storage; Iceberg / Delta / Hudi as the table format providing ACID, time travel, schema evolution, and hidden partitioning; Spark + Flink + Trino + Snowflake / BigQuery / Databricks all reading the same files; Unity Catalog / Polaris / Nessie for governance.
The death of HDFS.
The biggest single shift between 2010 and 2026 is the storage layer. HDFS — for fifteen years the source of truth — is now in maintenance mode at every major user. The reasons are economic and operational:
- Cost. S3 storage is ~1/10 the cost of an equivalent HDFS cluster (no replication on warm tier, separation of storage from compute, lifecycle policies).
- Operational burden. HDFS NameNode failover, DataNode rebalance, block report storms — none of these problems exist on managed object storage.
- Compute separation. S3 lets you spin compute up and down independently of storage. HDFS forced you to keep the cluster running 24/7 to keep the data alive.
The table format (Iceberg, Delta, Hudi) replaced what HDFS used to provide on top of the file system: schema, partition layout, transactional commits, time travel. The file system stopped being the source of truth; the table format catalog became the source of truth.
Why variety drives stack choice in 2026.
The 2010 stack was tuned for volume: how do we fit more bytes per dollar. The 2015 stack was tuned for velocity: how do we get from batch to micro-batch. The 2026 stack is tuned for variety: how do we hold tabular, semi-structured, binary, vector, and streaming data in the same governed substrate. That is the whole reason for the lakehouse — one table format that accepts all of them, queried by every engine.
Worked example — naming the right stack for a given era
Detailed explanation. A common interview probe: "Here is a workload description from 2014. What stack would you have shipped, and what would you ship now in 2026 instead?" The answer surfaces whether the candidate knows the history and can defend a modern rewrite.
Question. A retail company in 2014 ran 200 GB / day of POS transactions on a Cloudera HDP cluster with MapReduce + Hive + Oozie. In 2026, the same workload is 2 TB / day with a 15-minute freshness SLA. Map the 2014 stack to its 2026 equivalent component by component.
Input — 2014 vs 2026 mapping.
| Layer | 2014 (HDP) | 2026 (lakehouse) |
|---|---|---|
| Storage | HDFS | S3 + Iceberg |
| Compute | MapReduce | Spark (PySpark + Spark SQL) |
| Ingestion | Sqoop + Flume | Kafka + Debezium (CDC) |
| SQL surface | Hive | Trino / Spark SQL / Snowflake |
| Orchestration | Oozie | Airflow / Dagster |
| Governance | Sentry / Ranger | Unity Catalog / Polaris |
| Coordination | ZooKeeper | (managed in the platform) |
Code (PySpark equivalent of a 2014 MapReduce + Hive ETL).
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder \
.appName("pos-daily-rollup") \
.config("spark.sql.catalog.lake", "org.apache.iceberg.spark.SparkCatalog") \
.getOrCreate()
# Read raw POS events from Iceberg (replaces the 2014 Hive external table)
raw = spark.read.table("lake.raw.pos_events")
# Daily rollup per store and category
daily = (raw
.filter(F.col("event_ts") >= F.current_date() - F.expr("INTERVAL 1 DAY"))
.groupBy(F.to_date("event_ts").alias("event_date"), "store_id", "category")
.agg(
F.sum("amount").alias("revenue"),
F.countDistinct("basket_id").alias("baskets"),
F.count("*").alias("line_items"),
)
)
# Write to the lakehouse silver table (replaces the 2014 Hive INSERT OVERWRITE)
(daily.writeTo("lake.silver.pos_daily")
.using("iceberg")
.partitionedBy("event_date")
.createOrReplace())
Step-by-step explanation.
- The 2014 job lived in
MapReduce+Hive+Oozie. The same logic in 2026 is a single PySpark script — same algorithm, 50× fewer lines, 100× faster per dollar. - Storage replaced HDFS with S3 + Iceberg: same
pos_eventstable abstraction, but the file system is now durable cloud storage and the catalog handles ACID commits. - The
writeTo(...).using("iceberg")clause atomically commits a new snapshot; if the job crashes mid-write, the previous snapshot remains the source of truth. MapReduce + Hive offered nothing equivalent. - The
partitionedBy("event_date")clause uses Iceberg's hidden partitioning — readers do not need to writeWHERE event_date = ...to prune partitions.
Output (rollup table example).
| event_date | store_id | category | revenue | baskets | line_items |
|---|---|---|---|---|---|
| 2026-06-05 | 101 | grocery | 12_400 | 480 | 1_812 |
| 2026-06-05 | 101 | apparel | 4_900 | 110 | 213 |
| 2026-06-05 | 102 | grocery | 9_100 | 320 | 1_205 |
Rule of thumb. When mapping a legacy stack to a 2026 equivalent, walk it layer by layer (storage → compute → ingestion → SQL → orchestration → governance). The mapping is rarely 1:1 — sometimes two 2014 components collapse into one 2026 component (Sqoop + Flume → Kafka + Debezium), and sometimes the catalog absorbs what HDFS used to do.
Worked example — the file-system-to-table-format pivot
Detailed explanation. The single most consequential shift between 2010 and 2026 is moving the source of truth from the file system (HDFS) to the table format (Iceberg / Delta). Interviewers probe this because the operational, governance, and cost implications cascade everywhere.
Question. Explain in three concrete operational examples why the lakehouse table format is not "just S3 with extra steps" but a fundamental architectural pivot from HDFS-style file management.
Input — three operational scenarios.
| Scenario | HDFS / raw S3 | Iceberg / Delta lakehouse |
|---|---|---|
| Concurrent writers | last-writer-wins, possible partial files | ACID commit via metadata catalog |
| Schema change (new column) | rewrite every file or maintain two schemas |
ALTER TABLE ADD COLUMN — metadata only |
| Audit "what was the table at noon yesterday?" | restore from backup | SELECT ... TIMESTAMP AS OF '2026-06-05 12:00:00' |
Code (Iceberg time travel).
-- Read the table as it existed at a point in time
SELECT store_id, SUM(revenue) AS revenue_at_noon
FROM lake.silver.pos_daily TIMESTAMP AS OF '2026-06-05 12:00:00'
WHERE event_date = DATE '2026-06-05'
GROUP BY store_id;
-- Roll back the table to a previous snapshot
CALL lake.system.rollback_to_snapshot('silver.pos_daily', 7842365912834);
Step-by-step explanation.
- The
TIMESTAMP AS OFclause asks Iceberg to return the snapshot whose commit timestamp is closest to (but not after) noon yesterday. HDFS has no equivalent — you would have restored from a backup, an expensive and error-prone operation. - The
rollback_to_snapshotprocedure points the catalog at an older snapshot ID. Subsequent reads see the older state; the rollback is metadata-only and takes milliseconds. - Concurrent writers on HDFS were a permanent operational headache (partial files, NameNode contention). Iceberg's ACID commits via the catalog make concurrent writes safe by construction.
- Schema evolution on HDFS forced a file rewrite (or schema-on-read with downstream breakage). Iceberg's metadata layer absorbs the change — readers see the new schema, old snapshots stay valid.
Output.
| Capability | HDFS | Iceberg |
|---|---|---|
| Atomic commits | no | yes |
| Time travel | no (restore from backup) | yes (millisecond rollback) |
| Schema evolution | rewrite files | metadata-only |
| Hidden partitioning | manual WHERE clauses |
automatic prune |
| Multi-engine reads | one engine at a time | Spark + Flink + Trino + Snowflake |
Rule of thumb. When defending the lakehouse to a skeptical stakeholder, do not lead with "PB-scale" — lead with time travel, atomic commits, and multi-engine reads. Those three capabilities pay back the migration cost faster than any volume argument.
Big data interview question on the lakehouse pivot
A senior interviewer often probes: "Why did Iceberg and Delta become the default in 2024 even though S3 + Parquet had been around for a decade? What did the table format actually unlock?" The expected answer ties the three pivots — ACID, time travel, schema evolution — to concrete operational wins.
Solution Using the three-pivot lakehouse argument
def lakehouse_value(use_case):
"""Returns the lakehouse capability that pays back the migration."""
matrix = {
"regulatory_audit": ("time_travel", "Replay any historical table state by timestamp."),
"concurrent_writers": ("acid_commits", "Multiple jobs write the same table safely."),
"schema_evolution": ("metadata_only", "Add / drop columns without rewriting files."),
"multi_engine": ("open_format", "Spark + Flink + Trino read the same files."),
"small_files": ("compaction", "Iceberg manages small-file compaction."),
}
return matrix.get(use_case, ("evaluate", "Not a lakehouse-specific win."))
for case in ["regulatory_audit", "concurrent_writers", "schema_evolution", "multi_engine"]:
cap, why = lakehouse_value(case)
print(f"{case:24s} → {cap:18s} | {why}")
Step-by-step trace.
| Use case | Capability invoked | Why this wins |
|---|---|---|
| regulatory_audit | time_travel | Replay any historical state in seconds |
| concurrent_writers | acid_commits | No partial files, no race conditions |
| schema_evolution | metadata_only | New column ships in milliseconds |
| multi_engine | open_format | Spark + Flink + Trino + Snowflake share files |
Each row in the trace pairs a real operational pain point ("we got a GDPR audit request and need to show what the table looked like on 2025-12-31") with the lakehouse capability that turns it into a one-line query. The collective payoff is why the table format pivot was inevitable — even at S3 + Parquet scale.
Output:
| use_case | capability | why |
|---|---|---|
| regulatory_audit | time_travel | "Replay any historical table state by timestamp." |
| concurrent_writers | acid_commits | "Multiple jobs write the same table safely." |
| schema_evolution | metadata_only | "Add / drop columns without rewriting files." |
| multi_engine | open_format | "Spark + Flink + Trino read the same files." |
Why this works — concept by concept:
- ACID on object storage — Iceberg / Delta commit a new snapshot atomically through the catalog; partial writes become invisible until the commit succeeds. This is the single most consequential lakehouse capability.
- Time travel via snapshots — every commit produces a snapshot ID. Reading "as of timestamp T" finds the snapshot whose commit ≤ T. Cheap insurance against accidental writes, regulator requests, and replay-driven feature engineering.
- Schema evolution as metadata — adding a column updates the table metadata but leaves the underlying files alone; readers project the new column as NULL for older rows. Rewrites are deferred and optional.
- Open format, many engines — the table is just Parquet files + a JSON / Avro metadata tree. Any engine that speaks the Iceberg or Delta spec can read it — no vendor lock-in.
- Cost — the metadata layer adds a small per-commit overhead (milliseconds) and a small per-read overhead (one metadata fetch per query). Storage is plain Parquet on S3 — the same files the raw-S3 architecture already uses.
DE
Topic — aggregation
Aggregation problems for big data pipelines
3. The 2026 stack map — Hadoop ecosystem, Spark, Kafka, Lakehouse
Every modern big data job is some permutation of read from storage → transform on compute → write back to storage — and the 2026 stack has exactly five layers that make this true
The mental model in one line: the 2026 big data stack has five layers (storage, compute, ingestion, orchestration, governance) and every component slots into exactly one layer; you can swap any component for a same-layer peer without rewriting the others. Once you can draw the five-layer diagram from memory, every architecture-design interview becomes a "pick one component per layer and defend it" exercise.
Layer 1 — Storage (cloud object + table format).
- Object stores: S3 (AWS), GCS (Google), Azure Blob, Cloudflare R2, MinIO (on-prem). These give you durability, lifecycle policies, and pay-as-you-go pricing.
- Table formats: Iceberg, Delta Lake, Hudi. All three give you ACID, time travel, schema evolution, hidden partitioning. Pick one per organisation; do not mix.
- File formats: Parquet (default), ORC (Hive legacy), Avro (schema-evolving streams). 95% of new tables are Parquet.
- Why this layer matters: every other layer is interchangeable, but the storage + table format choice cascades through every read and write for years.
Layer 2 — Compute engines.
- Spark (PySpark + Spark SQL): the workhorse. Batch ETL, large joins, ML feature engineering, large window functions. Default starting point for any new batch job.
- Flink: the streaming specialist. Sub-second latency, exactly-once semantics, complex event processing. Default for any new streaming job under a 5-second SLA.
- Trino / Presto: the interactive query engine. BI dashboards, ad-hoc SQL across lakehouse tables, high concurrency. Default for the analyst-facing query layer.
- Managed warehouses: Snowflake, BigQuery, Databricks SQL. Same SQL surface, fully managed, generally more expensive per query but cheaper to operate.
- Why this layer matters: the compute engine choice dominates per-query cost and developer productivity. Most teams ship a hybrid: Spark for ETL, Trino for ad-hoc, a managed warehouse for BI.
Layer 3 — Ingestion (batch + streaming).
- Kafka: the default streaming backbone. Topics, partitions, consumer groups, exactly-once semantics. Almost every modern data platform has Kafka in the middle.
- Kinesis / Pulsar: AWS-native and open-source alternatives to Kafka. Same mental model, different operational footprint.
- CDC tools: Debezium (open source), Fivetran (managed), Estuary, Airbyte. They turn transactional databases (Postgres, MySQL) into streaming sources by reading the write-ahead log.
- Bulk loaders: Sqoop (legacy), Airflow operators, dbt sources for managed loads.
- Why this layer matters: ingestion is where pipelines silently lose data. Idempotency, replay, and back-pressure are all decided here.
Layer 4 — Orchestration and transformation.
- Airflow: the de facto orchestrator. Python DAGs, mature ecosystem, easy hire-for skill.
- Dagster: the modern challenger. Asset-centric, native lineage, better testing story.
- Prefect: the lightweight alternative. Python-native, easier local dev.
- Argo Workflows: the Kubernetes-native option. Container-per-task, infrastructure-as-code by default.
- dbt: the SQL transformation layer. Models, tests, docs, lineage. Not an orchestrator — it runs inside the orchestrator.
- Why this layer matters: orchestration is where reliability lives. The choice dictates how you handle retries, alerting, and lineage.
Layer 5 — Governance and catalog.
- Unity Catalog (Databricks): the most mature lakehouse catalog. Handles permissions, lineage, audit, multi-region.
- Polaris (Snowflake): Snowflake's open-source Iceberg catalog, designed for cross-engine federation.
- Nessie: Git-style branching for Iceberg tables. Branch-per-experiment workflows for data.
- OpenMetadata / Atlan / DataHub: the metadata + lineage + discovery plane. Lives on top of the catalog.
- Why this layer matters: governance is where regulators, security, and data product teams meet. Skipping the catalog at the start guarantees you rewrite it in two years.
Where Hadoop / Hive / Oozie still appear.
- CDH / HDP migrations. Every team that ran Cloudera Distribution for Hadoop or Hortonworks Data Platform is still in the middle of a multi-year migration off the legacy stack. Senior DEs are paid well to lead these.
- Hive metastore. Still a backing catalog under Spark in many shops; Iceberg can use it as a catalog implementation.
- Oozie. Rarely chosen for new work; maintained where it runs the last few legacy jobs.
- Pig. Effectively dead. Never start a new pipeline in Pig.
The "one chart" mental model.
Every job is one sentence: read from storage → transform on compute → write back to storage. The five-layer map tells you which component you picked for each verb. If you can defend each pick with a one-line "why," you have a designed architecture; if you cannot, you have a pile of tools.
Worked example — picking one component per layer for a new platform
Detailed explanation. A common architecture interview asks "design the data platform for a mid-stage fintech with 500 GB / day of transaction data, 20-minute freshness SLA, regulatory audit needs, and a team of six DEs." The expected answer is a five-layer pick list with justifications.
Question. Pick one component per layer for the workload above. Justify each choice in one sentence.
Input — workload.
| Dimension | Value |
|---|---|
| Volume / day | 500 GB |
| Freshness SLA | 20 minutes |
| Backfill | 7 years (regulator) |
| Team size | 6 DEs |
| Cloud | AWS |
| Audit need | Yes (PCI-DSS) |
Code (architecture pick list).
storage:
object_store: S3
table_format: Iceberg
file_format: Parquet
compute:
primary: Spark (EMR or Databricks)
streaming: Spark Structured Streaming (re-using Spark skills)
bi_layer: Trino (or Snowflake if budget allows)
ingestion:
streaming: Kafka (MSK managed)
cdc: Debezium → Kafka
batch: Airflow S3 sensor + Spark loader
orchestration:
primary: Airflow (MWAA managed)
sql_xform: dbt-core on top
governance:
catalog: Glue + Polaris (Iceberg REST)
lineage: OpenMetadata
pii: Lake Formation policies
Step-by-step explanation.
- Storage — S3 + Iceberg + Parquet is the lowest-cost, highest-flexibility base layer. Iceberg's time travel covers the 7-year audit requirement without hot warehouse storage.
- Compute — Spark for batch (well-known by the team), Spark Structured Streaming for the 20-minute SLA (re-uses Spark expertise), Trino for ad-hoc analyst queries against the lakehouse.
- Ingestion — Kafka (MSK) as the streaming backbone, Debezium for CDC off the transactional database, Airflow + Spark for batch loads from third-party APIs.
- Orchestration — Airflow on AWS MWAA for managed operations, dbt for SQL transformations on top of the lakehouse silver / gold layers.
- Governance — Glue as the metadata catalog (AWS-native), Polaris as the Iceberg REST catalog for cross-engine compatibility, OpenMetadata for lineage and discovery, Lake Formation for column-level PII policies.
Output.
| Layer | Pick | Why in one line |
|---|---|---|
| Storage | S3 + Iceberg + Parquet | cheapest base + time travel for audits |
| Compute | Spark + Trino | one engine for batch + streaming, one for BI |
| Ingestion | Kafka + Debezium | one streaming backbone, CDC for transactional sources |
| Orchestration | Airflow + dbt | managed orchestrator + SQL transform layer |
| Governance | Glue + Polaris + OpenMetadata | AWS-native catalog + open REST + lineage |
Rule of thumb. Pick one component per layer and defend it with one sentence. If you cannot defend the pick, you have not actually designed the architecture — you have collected a list of tools.
Worked example — when to swap a component without rewriting the rest
Detailed explanation. The five-layer model implies that components within a layer are interchangeable. The interesting question is: under what pressure do you actually swap? The answer is almost always cost or scale, never hype.
Question. Your platform is running Spark + Trino + Iceberg on S3 with Airflow. Costs are climbing because Spark cluster utilisation is low. What is the smallest swap that reduces cost without rewriting the platform?
Input — current cost breakdown.
| Component | Monthly cost | Utilisation |
|---|---|---|
| Spark EMR | $42_000 | 18% |
| Trino EMR | $9_000 | 60% |
| MSK Kafka | $6_000 | 45% |
| S3 + Iceberg | $4_000 | n/a |
| Airflow MWAA | $1_500 | n/a |
Code (pseudocode for the smallest-rewrite swap).
def recommend_swap(cost, util):
"""
The smallest-rewrite swap is the one that:
- touches only one layer
- keeps the table format and storage intact
- has a managed alternative with idle-time cost = 0
"""
if cost["Spark EMR"] > 30_000 and util["Spark EMR"] < 0.25:
return ("compute_layer",
"Move batch Spark from EMR to Databricks SQL Serverless "
"or Snowflake (auto-suspend on idle). Keep Iceberg tables, "
"keep Trino for ad-hoc, keep Airflow.")
return ("no_swap", "Costs are within range; tune queries first.")
print(recommend_swap(
cost={"Spark EMR": 42_000, "Trino EMR": 9_000, "MSK Kafka": 6_000},
util={"Spark EMR": 0.18, "Trino EMR": 0.60, "MSK Kafka": 0.45},
))
Step-by-step explanation.
- The Spark EMR cost ($42K / month) dominates the platform and runs at 18% utilisation — classic idle-time waste.
- The smallest-rewrite swap is within the compute layer only: move Spark from EMR (always-on cluster) to Databricks SQL Serverless or Snowflake (auto-suspend on idle).
- The storage layer (Iceberg on S3) stays unchanged — both alternatives speak Iceberg via the catalog.
- The orchestration (Airflow) stays unchanged — it just calls a different compute endpoint.
- The Trino layer stays unchanged for ad-hoc analyst queries — different engines for different read patterns is normal.
- The expected saving is roughly 50-70% of the Spark EMR cost on this workload (auto-suspend means you only pay for the active query minutes).
Output.
| Layer touched | Change | Expected saving |
|---|---|---|
| Compute (batch) | EMR Spark → managed serverless | $20K-$28K / month |
| Storage | unchanged | $0 |
| Ingestion | unchanged | $0 |
| Orchestration | unchanged | $0 |
| Governance | unchanged | $0 |
Rule of thumb. When cost or scale pressure hits one layer, swap within that layer before considering a multi-layer rewrite. The five-layer model is what makes this possible — components within a layer share interfaces (Iceberg files, Kafka topics, Airflow DAGs), so a single-layer swap is mostly a configuration change, not a re-architecture.
Big data interview question on stack architecture
A senior interviewer often frames this as: "Walk me through your team's data platform — what is at the storage layer, what is at the compute layer, what is at the ingestion layer, and how do you defend each pick? Where would you swap and where would you not?" The probe rewards candidates who can think in layers rather than in tools.
Solution Using the five-layer pick-list framework
def describe_platform(layers):
"""Print a defensible per-layer architecture description."""
order = ["storage", "compute", "ingestion", "orchestration", "governance"]
for layer in order:
pick = layers[layer]["pick"]
why = layers[layer]["why"]
swap_pressure = layers[layer]["swap_when"]
print(f"Layer: {layer:14s} → {pick}")
print(f" Why: {why}")
print(f" Swap when: {swap_pressure}")
print()
describe_platform({
"storage": {
"pick": "S3 + Iceberg + Parquet",
"why": "Cheap, durable, time travel for audits, open format.",
"swap_when": "Almost never — storage choice is sticky for 5+ years.",
},
"compute": {
"pick": "Spark (batch) + Flink (streaming) + Trino (BI)",
"why": "Each engine matches a latency tier; all read Iceberg.",
"swap_when": "Cost pressure on the dominant engine.",
},
"ingestion": {
"pick": "Kafka + Debezium",
"why": "Streaming backbone + CDC for transactional sources.",
"swap_when": "Operational cost of Kafka exceeds Kinesis on the cloud.",
},
"orchestration": {
"pick": "Airflow + dbt",
"why": "Mature scheduler + SQL transform layer with tests.",
"swap_when": "Asset-centric workflows are required (move to Dagster).",
},
"governance": {
"pick": "Glue + Polaris + OpenMetadata",
"why": "AWS-native catalog + open REST + lineage.",
"swap_when": "Multi-cloud parity is required (move to Unity Catalog).",
},
})
Step-by-step trace.
| Layer | Pick | Why | Swap pressure |
|---|---|---|---|
| storage | S3 + Iceberg | cheap, durable, time travel | almost never |
| compute | Spark + Flink + Trino | three latency tiers covered | cost on dominant engine |
| ingestion | Kafka + Debezium | streaming + CDC | Kafka ops cost |
| orchestration | Airflow + dbt | mature scheduler + SQL xform | asset-centric needs |
| governance | Glue + Polaris + OpenMetadata | AWS + open + lineage | multi-cloud parity |
The trace makes the architecture defensible row by row. An interviewer who hears this framework knows the candidate can think in layers, swap one component without re-architecting the rest, and defend choices against cost or scale pressure.
Output:
| Layer | Pick | Defended against |
|---|---|---|
| storage | S3 + Iceberg | cost, audit, multi-engine reads |
| compute | Spark + Flink + Trino | latency tiers, developer productivity |
| ingestion | Kafka + Debezium | replay, CDC, exactly-once |
| orchestration | Airflow + dbt | reliability, lineage, hire-for skill |
| governance | Glue + Polaris + OpenMetadata | regulators, PII, discovery |
Why this works — concept by concept:
- One pick per layer — every layer has exactly one component answering for it. Two competing components in the same layer is a sign of a half-finished migration.
- Each pick has a one-line why — if you cannot summarise the choice in one sentence, you copied it from a vendor's marketing page.
- Swap pressure named explicitly — the architecture is not "forever"; each layer has a documented swap trigger so the team knows when to revisit.
- Layer interfaces are stable — Iceberg files, Kafka topics, Airflow DAGs, catalog APIs. Stable interfaces are what makes single-layer swaps cheap.
- Cost — the framework is O(layers) to describe and O(1) to swap one component. The real cost is the discipline of saying "no" to second components in the same layer.
DE
Topic — optimization
Distributed query optimization problems
4. Batch vs streaming — Lambda is dead, Kappa rules
One streaming pipeline with exactly-once semantics + replay covers both real-time and batch use cases — that is why Lambda gave way to Kappa
The mental model in one line: Lambda solved the "I need real-time and batch" problem in 2012 by running two pipelines; Kappa solved the duplication problem in 2020 by running one streaming pipeline and replaying the topic for backfills. Once you can state when batch still wins in 2026 (cost, simplicity, ML training sets, finance close), you have the full architectural debate at your fingertips.
The classical Lambda architecture (2012).
- Batch layer. Authoritative historical view. Spark or MapReduce reads from HDFS / S3, computes "the truth," writes to a Hive table. Updated nightly.
- Speed layer. Real-time view. Storm or Spark Streaming reads from Kafka, computes "the recent delta," writes to a fast store (HBase, Cassandra).
- Serving layer. Merges the two views and serves the union to BI dashboards.
- Why it was invented. Streaming engines in 2012 could not give you exactly-once semantics. The batch layer existed to correct the speed layer's approximations.
- Why it died. Two codebases — one in Spark batch, one in Spark Streaming — implementing the same business logic. They drifted, the merge layer was fragile, and the cost of maintaining both was painful.
Kappa architecture (2020 reference).
- One stream, one pipeline. Kafka holds the canonical replayable log. Flink or Spark Structured Streaming reads it once; writes to Iceberg / Delta with ACID commits.
- Backfill = replay. Need to re-process two years of history? Reset the consumer offset to 0 and let the streaming job re-emit the same outputs.
- Why it works. Modern streaming engines deliver exactly-once semantics out of the box, so the "speed layer needs batch correction" problem disappears.
- What it requires. Idempotent sinks, checkpointed state, watermarks for late data. Three primitives, all standard in Flink and Spark Structured Streaming since 2019.
The modern reference architecture.
Sources (clickstream, CDC, logs) → Kafka (replayable log)
→ Flink / Spark Structured Streaming (transform)
→ Iceberg / Delta (ACID lakehouse table)
→ BI dashboards (Trino / managed warehouse)
→ ML feature stores
→ Real-time serving (Pinot / Druid / Redis)
Exactly-once semantics — three primitives, all required.
- Idempotent sinks. Re-emitting the same event must produce the same outcome. Iceberg / Delta provide this via primary-key merge semantics.
- Checkpoints. The engine periodically writes its state and source offsets atomically. On crash, it resumes from the last checkpoint and reads the same events without duplication.
- Watermarks. The engine tracks "event time" vs "processing time" and decides when a window is complete enough to emit. Late events that arrive after the watermark either go into a late-data pane or are dropped — explicit, not surprising.
Latency tiers — pick the engine by SLA, not by hype.
| Latency SLA | Engine | Use case |
|---|---|---|
| < 1 second | Flink (native streaming) | Real-time bidding, fraud detection, live alerts |
| Seconds (1-10s) | Spark Structured Streaming | Dashboards, near-real-time enrichment |
| Minutes (1-15m) | Spark micro-batch | Streaming ETL with cost over latency |
| Hours (1h+) | Spark batch / dbt incremental | Reporting, ML training sets, finance close |
Backfill story — replay vs separate batch.
The Lambda answer to backfill was "run the batch layer on the historical data." The Kappa answer is "rewind the Kafka consumer." The Kappa story is dramatically simpler in practice:
- One codebase for both real-time and backfill.
- One set of tests that catch both regressions.
- One deployment process for both modes.
The cost is that Kafka retention must cover the longest plausible backfill window — usually 7 to 30 days for warm replay. Beyond that, you replay from an archived copy of the topic in S3 (which is essentially a batch read, but through the same code path).
When batch still wins in 2026.
Despite Kappa's elegance, batch jobs still dominate in four real situations:
- Cost. A nightly Spark job on EMR is dramatically cheaper than a 24×7 Flink cluster for use cases where latency does not matter.
- Simplicity. A junior engineer can ship a batch ETL in a week; a senior engineer needs three weeks to ship a robust streaming equivalent.
- ML training sets. Models train on stable historical snapshots, not on the live stream. A nightly batch job is the right shape for this.
- Finance close. Month-end financial closes need a frozen, audited snapshot — exactly what a batch run produces. Streaming makes the audit story harder, not easier.
Worked example — picking the right engine for a given latency tier
Detailed explanation. A common interview probe: given four downstream consumers with four different latency SLAs, design the engine pick for each. The expected answer matches engine to SLA — and explicitly avoids "use streaming for everything."
Question. A platform has four downstream consumers: (1) a live fraud-detection model with < 1s SLA; (2) a dashboard with 10s SLA; (3) an analyst-facing BI report with 15-minute SLA; (4) a nightly ML training pipeline. Map each to its engine.
Input — consumer SLAs.
| Consumer | SLA | Volume |
|---|---|---|
| Fraud detection | < 1 second | 5K events / sec |
| Live dashboard | < 10 seconds | 100K events / hour |
| BI report | 15 minutes | 50 GB / hour |
| ML training | 1 nightly run | 500 GB / day |
Code.
def pick_engine(sla_seconds):
if sla_seconds < 1:
return "Flink (native streaming, exactly-once)"
if sla_seconds <= 10:
return "Spark Structured Streaming (micro-batch, exactly-once)"
if sla_seconds <= 900: # 15 minutes
return "Spark micro-batch / dbt incremental"
return "Spark batch (nightly / hourly)"
for name, sla in [("fraud", 0.5), ("dashboard", 10),
("bi_report", 900), ("ml_training", 86_400)]:
print(f"{name:12s} | SLA={sla:>5d}s → {pick_engine(sla)}")
Step-by-step explanation.
- Fraud detection at < 1s SLA → Flink. No other engine gives reliable sub-second latency at exactly-once semantics.
- Live dashboard at 10s SLA → Spark Structured Streaming. The micro-batch interval can be set to 5s and comfortably meet the 10s SLA. Re-uses Spark expertise.
- BI report at 15-minute SLA → Spark micro-batch or dbt incremental. Same Spark cluster as the dashboard, just a different schedule.
- ML training at nightly cadence → Spark batch. Stable snapshot of the lakehouse table, no streaming overhead.
Output.
| Consumer | Engine | Reason |
|---|---|---|
| Fraud detection | Flink | sub-second latency required |
| Live dashboard | Spark Structured Streaming | 10s SLA, fits micro-batch |
| BI report | Spark micro-batch / dbt | 15-min SLA, cost > latency |
| ML training | Spark batch | nightly stable snapshot |
Rule of thumb. Pick the engine by the SLA, not by the hype. Use Flink only when the SLA requires sub-second; use Spark Structured Streaming when seconds are enough; use batch when the SLA is measured in minutes or longer. Mixing engines is normal — most mid-stage platforms run three at the same time.
Worked example — exactly-once with idempotent sink + checkpoint + watermark
Detailed explanation. "Exactly-once semantics" is the marketing line; the engineering reality is three primitives that must all be present. Interviewers love to probe this because it forces the candidate to explain the actual mechanism, not just the slogan.
Question. Walk through a Spark Structured Streaming job that reads from Kafka, enriches with a lookup table, and writes to an Iceberg sink with exactly-once semantics. Name each primitive that makes it correct.
Input — Kafka topic schema.
| field | type |
|---|---|
| event_id | UUID (primary key) |
| event_ts | TIMESTAMP |
| user_id | LONG |
| amount | DECIMAL |
Code.
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder \
.appName("orders-streaming") \
.config("spark.sql.streaming.checkpointLocation", "s3://chk/orders/") \
.getOrCreate()
raw = (spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka:9092")
.option("subscribe", "orders.v1")
.option("startingOffsets", "latest")
.load())
parsed = (raw.selectExpr("CAST(value AS STRING) AS json")
.select(F.from_json("json", schema).alias("e"))
.select("e.*")
.withWatermark("event_ts", "10 minutes"))
# Enrich with broadcast lookup
users = spark.table("lake.silver.users")
enriched = parsed.join(F.broadcast(users), "user_id", "left")
(enriched.writeStream
.format("iceberg")
.outputMode("append")
.option("path", "lake.silver.orders_enriched")
.option("fanout-enabled", "true")
.trigger(processingTime="5 seconds")
.start())
Step-by-step explanation.
-
Checkpoint location —
spark.sql.streaming.checkpointLocation = "s3://chk/orders/". After each micro-batch, Spark atomically writes the source offsets and the state into S3. On restart, it reads the last successful checkpoint and resumes exactly there — no events skipped, no duplicates. -
Watermark —
withWatermark("event_ts", "10 minutes")declares that events more than 10 minutes late may be dropped. This bounds the state size and lets the engine close windows safely. - Idempotent sink — Iceberg's append + merge semantics dedupe by primary key. Even if Spark re-emits the same event on recovery, the Iceberg sink resolves to a single committed row.
-
Broadcast join — the small
userstable fits in executor memory and avoids a shuffle. Streaming joins to large tables either need state stores or a different design. -
Trigger interval —
processingTime="5 seconds"runs a micro-batch every 5 seconds. The SLA for downstream readers is bounded by this interval plus the processing time.
Output (sample enriched stream snapshot).
| event_id | event_ts | user_id | name | amount |
|---|---|---|---|---|
| 1f...c4 | 2026-06-05 12:00:01 | 42 | Alice | 120.00 |
| 9a...e2 | 2026-06-05 12:00:02 | 17 | Bob | 80.00 |
| 4d...07 | 2026-06-05 12:00:03 | 42 | Alice | 35.00 |
Rule of thumb. Any "we have exactly-once semantics" claim that does not name a checkpoint location, a watermark, and an idempotent sink is marketing, not engineering. Make those three primitives visible in code review.
Worked example — backfill via Kafka replay vs separate batch
Detailed explanation. A senior interviewer probes the Kappa promise: "Your streaming pipeline has been broken for a week and you need to re-process every event. Walk me through doing this without writing a separate batch job."
Question. Explain how to replay a Kafka topic from a specific offset and re-process through the same Spark Structured Streaming job that runs in production. What guarantees does this give you, and what is the cost?
Input — operational scenario.
| Detail | Value |
|---|---|
| Topic | orders.v1 |
| Partitions | 12 |
| Retention | 14 days |
| Broken window | 7 days |
| Sink | Iceberg lake.silver.orders_enriched
|
Code.
# Step 1 — stop the live consumer group
# (kafka admin) reset offsets to 7 days ago
# kafka-consumer-groups.sh --reset-offsets --to-datetime 2026-05-29T00:00:00 \
# --group orders-streaming --topic orders.v1 --execute
# Step 2 — start the streaming job with a new checkpoint dir
(spark.readStream
.format("kafka")
.option("subscribe", "orders.v1")
.option("startingOffsets", "earliest") # honour reset offsets
.load()
# ... same transformation as production ...
.writeStream
.format("iceberg")
.option("checkpointLocation", "s3://chk/orders-backfill/") # separate dir
.outputMode("append")
.trigger(availableNow=True) # run-to-completion mode
.start())
Step-by-step explanation.
- Reset the consumer group offsets to the start of the broken window (7 days ago). Kafka's retention (14 days) covers the replay window.
- Start the same Spark Structured Streaming job with a new checkpoint directory (so it does not collide with the live job) and
trigger(availableNow=True)so it processes the backlog and exits. - The Iceberg sink dedupes by primary key, so any rows that were successfully written before the break are not duplicated — they merge to the same key.
- The cost is roughly 7 × the per-day cost of the live job, but you avoid writing and maintaining a separate batch path.
- When the replay finishes, restart the live consumer with
startingOffsets = "latest"to resume normal operation.
Output.
| Metric | Before backfill | After backfill |
|---|---|---|
| Rows missing in window | ~1.2M | 0 |
| Distinct codebases | 1 (streaming) | 1 (streaming) |
| Operational steps | n/a | offset reset + one-shot run |
Rule of thumb. The Kappa promise is "one codebase serves both real-time and backfill." Make Kafka retention long enough to cover your worst-case backfill scenario (usually 14-30 days for warm replay; archive to S3 beyond that), and your team will never write a parallel batch implementation again.
Big data interview question on streaming architecture
A senior interviewer often frames this as: "Defend Kappa over Lambda for a real platform. Where does Kappa break down, and when would you still keep a batch path?" The probe rewards candidates who can both endorse Kappa and name its honest limits.
Solution Using the Kappa + selective batch hybrid
def architecture_choice(workload):
"""
Returns Kappa for hot paths, batch for cost-sensitive cold paths.
"""
sla = workload["sla_seconds"]
backfill_years = workload["backfill_years"]
cost_sensitive = workload["cost_sensitive"]
audit_freeze = workload["audit_freeze"]
if sla <= 30:
path = "kappa_streaming"
elif cost_sensitive or backfill_years > 3 or audit_freeze:
path = "batch"
else:
path = "kappa_micro_batch"
return path
print(architecture_choice({
"sla_seconds": 5,
"backfill_years": 7,
"cost_sensitive": True,
"audit_freeze": True,
})) # → 'kappa_streaming' (SLA dominates)
Step-by-step trace.
| Workload signal | Branch | Result |
|---|---|---|
| SLA ≤ 30s | streaming branch | kappa_streaming |
| SLA > 30s + cost or audit | batch branch | batch |
| Everything else | default | kappa_micro_batch |
The trace shows that SLA dominates the streaming-vs-batch choice. Once a workload has a sub-30-second SLA, the rest of the signals (cost, backfill, audit) influence the sink choice (write to an audited frozen snapshot vs the live table) but not the engine choice.
Output:
| Hot path (Kappa) | Cold path (batch) |
|---|---|
| Sources → Kafka → Flink / Spark Structured Streaming → Iceberg (live) | Iceberg (live) → nightly Spark snapshot → Iceberg (frozen audit table) |
Why this works — concept by concept:
- Kappa for the hot path — one streaming pipeline + Kafka replay covers both real-time and most backfills, with one codebase. This is the modern default.
- Selective batch for cold paths — finance closes and ML training need a frozen, audited snapshot. A nightly Spark job on top of the same Iceberg table produces that snapshot at near-zero extra engineering cost.
- Exactly-once is non-negotiable — every streaming sink must dedupe by primary key (Iceberg merge, Delta merge, or a downstream UPSERT). Without it, replay introduces duplicates.
- Watermarks bound state — without a watermark, the streaming engine grows its state indefinitely and eventually OOMs. Always declare one.
- Cost — Kappa pays the always-on streaming cost (proportional to the SLA) plus the lakehouse storage cost. Batch adds a marginal nightly compute cost for the cold-path frozen snapshot.
DE
Topic — streaming
Streaming pipeline problems
5. The 6-month learning ladder for 2026
Six months, six skill steps, one portfolio repo — the minimum viable 2026 big data engineer takes 24 focused weeks to build, not three years
The mental model in one line: the 2026 big data engineer is built on SQL fluency, PySpark + Spark SQL, Kafka, the lakehouse pattern, orchestration + CI/CD, and a capstone — in that order. Once you can defend the order ("SQL first because it is the universal interface; Spark second because it is the universal compute engine"), every "how should I learn big data?" question collapses to a single roadmap.
Month 1 — SQL mastery.
-
Topics. Window functions (RANK, ROW_NUMBER, LAG / LEAD, running totals), CTEs (including recursive), query planning (
EXPLAIN,EXPLAIN ANALYZE), set operations (UNION / INTERSECT / EXCEPT), DDL fluency, NULL handling. - Tools. Postgres locally, BigQuery sandbox or Snowflake free tier.
- Outcome. Senior-level SQL fluency. You can read a 200-line query in one pass, name the bottleneck in the EXPLAIN plan, and rewrite it to half the cost.
- Why first. SQL is the universal interface for every layer of the stack. Skipping it guarantees that you write bad PySpark and miss every analyst conversation.
Month 2 — PySpark + Spark SQL.
- Topics. DataFrame API, partitioning, broadcast joins, AQE (adaptive query execution), Catalyst optimiser, the shuffle, executor / driver memory tuning, the partition lifecycle.
- Tools. Local Spark via Docker, then a free Databricks Community Edition account or AWS EMR.
- Outcome. Tune a 1 TB Spark job. Know when to broadcast, when to repartition, when to cache, and when to leave it alone.
- Why second. Spark is the universal compute engine. Almost every big data interview probes Spark internals.
Month 3 — Kafka fundamentals.
- Topics. Topics, partitions, consumer groups, exactly-once semantics, producer / consumer APIs, Schema Registry, consumer offset management, replay.
- Tools. Local Kafka via Docker Compose, then a managed cluster (MSK, Confluent Cloud free tier, or Aiven).
- Outcome. Build a producer + consumer in Python. Reset consumer offsets to replay a window. Explain why keys go to the same partition.
- Why third. Kafka is the streaming substrate; learning the lakehouse without Kafka leaves the streaming half of the stack invisible.
Month 4 — Lakehouse + Iceberg / Delta.
- Topics. ACID on object storage, time travel, schema evolution, hidden partitioning, CDC merge patterns, table optimisation (compaction, snapshot expiration), catalog operations.
- Tools. Spark + Iceberg locally, then Databricks (Delta) or AWS Glue + Iceberg.
- Outcome. Ship an Iceberg table with CDC merge. Time-travel back to a snapshot. Explain hidden partitioning to a sceptical analyst.
- Why fourth. The lakehouse is what ties storage, compute, and streaming together; you cannot defend the 2026 stack without it.
Month 5 — Orchestration + CI/CD for data.
- Topics. Airflow DAG design, dbt models + tests + docs, GitHub Actions for CI, Great Expectations or Soda for data quality, secrets management, environment promotion (dev → staging → prod).
- Tools. Local Airflow via Astronomer CLI, dbt-core, GitHub Actions, Great Expectations.
- Outcome. Production-grade dbt project with tests, CI, and lineage. An Airflow DAG that retries, alerts, and reports SLA misses.
- Why fifth. Orchestration is where reliability lives. Junior engineers ship pipelines; senior engineers ship pipelines that recover.
Month 6 — Capstone.
- Topics. End-to-end streaming + batch pipeline on AWS / GCP. Cost monitoring. On-call runbook. System design write-up. Public portfolio repo.
- Tools. Whatever your target employer uses — usually AWS (S3, MSK, EMR, MWAA, Glue) or GCP (GCS, Pub/Sub, Dataproc, Composer, BigQuery).
- Outcome. A repo with a streaming-to-lakehouse pipeline, batch reporting jobs, CI/CD, a cost dashboard, and a written architecture doc. This is the artefact you point hiring teams at.
- Why last. The capstone is the proof. Without it, the previous five months are just a learning log; with it, you have evidence.
What to skip in 2026.
- Pure Hadoop MapReduce. Use the time to learn Spark instead.
- On-prem HDFS deployment. The job market for HDFS administrators is small and shrinking.
- Raw Hive QL. Learn Spark SQL or Trino SQL instead — both supersets of Hive QL.
- Pig. Effectively dead.
- Oozie. Maintained, not chosen for new work.
- Storm. Replaced by Flink and Spark Structured Streaming.
- Tez. Niche, mostly Hive-internal.
Worked example — first month's SQL drill plan
Detailed explanation. The single biggest predictor of success in months 2-6 is SQL fluency. New big data engineers who try to skip Month 1 spend the next five months writing PySpark that reads like procedural Python — slow, ugly, untunable. A focused four-week SQL plan compounds.
Question. Design a 4-week SQL drill plan for a learner entering the roadmap. Allocate week-by-week themes and one capstone problem per week.
Input — learner profile.
| Dimension | Value |
|---|---|
| Hours / day | 1.5 |
| Days / week | 5 |
| Background | basic SELECT / WHERE / GROUP BY |
| Target | senior-level fluency by end of week 4 |
Code (drill schedule).
plan = {
"Week 1 — joins & aggregation": [
"INNER / LEFT / FULL OUTER JOIN",
"Self-joins for hierarchies",
"GROUP BY + HAVING with NULL",
"Capstone: monthly active users from event stream",
],
"Week 2 — window functions": [
"ROW_NUMBER / RANK / DENSE_RANK",
"LAG / LEAD",
"Running totals via PARTITION BY ORDER BY",
"Capstone: top-3 products per category by revenue",
],
"Week 3 — CTEs & query planning": [
"Recursive CTEs (org charts, manager trees)",
"EXPLAIN ANALYZE on a 5-join query",
"Index design + selectivity intuition",
"Capstone: rewrite a 200-line query to half the cost",
],
"Week 4 — NULL handling & set ops": [
"IS NULL vs = NULL, COALESCE, NULLIF",
"UNION / INTERSECT / EXCEPT semantics",
"Three-valued logic in WHERE / ON / HAVING",
"Capstone: snapshot diff between two days using FULL OUTER JOIN",
],
}
for week, topics in plan.items():
print(week)
for t in topics:
print(f" - {t}")
print()
Step-by-step explanation.
- Week 1 covers the JOIN family because every downstream concept assumes JOIN fluency. The capstone (MAU from events) forces a self-join or window-function deduplication.
- Week 2 introduces window functions because they unlock the "per-group ranked" family of problems that dominate interview SQL.
- Week 3 makes the learner read execution plans — the single most senior-distinguishing skill. The capstone (rewrite a slow query) builds the intuition of cost-aware SQL.
- Week 4 closes the NULL gap (responsible for ~40% of silent production bugs) and the set-ops surface (UNION / INTERSECT / EXCEPT and the NULL-safe FULL OUTER JOIN diff pattern).
Output.
| Week | Theme | Capstone problem |
|---|---|---|
| 1 | JOIN + aggregation | Monthly active users |
| 2 | Window functions | Top-3 per category |
| 3 | CTEs + query plans | Cost-halve a 200-line query |
| 4 | NULL + set ops | Snapshot diff via FULL OUTER JOIN |
Rule of thumb. Compress the SQL month into 4 focused weeks with one capstone per week. By the end, you should be able to read any production query in one pass and name the bottleneck — that is the fluency bar before you touch Spark.
Worked example — Month 2 PySpark drill — a tunable 1 TB job
Detailed explanation. The Month 2 capstone is a 1 TB join. The point is not the data volume; the point is the tuning loop — you watch the Spark UI, identify a skew, fix it, re-run, and feel the cost drop.
Question. Design a Month 2 PySpark drill that takes a 1 TB orders table joined to a 10 MB users table and walks the learner through the broadcast join optimisation.
Input — drill setup.
| Table | Size | Note |
|---|---|---|
| orders | 1 TB | partitioned by event_date |
| users | 10 MB | small lookup |
Code.
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder \
.appName("month2-drill") \
.config("spark.sql.adaptive.enabled", "true") \
.config("spark.sql.adaptive.skewJoin.enabled", "true") \
.getOrCreate()
orders = spark.read.table("lake.bronze.orders").where(
F.col("event_date") >= F.current_date() - F.expr("INTERVAL 30 DAYS")
)
users = spark.read.table("lake.silver.users")
# Step A — naive shuffle join (the baseline)
naive = orders.join(users, "user_id", "left") \
.groupBy("country").agg(F.sum("amount").alias("revenue"))
# Step B — broadcast join (the optimised version)
fast = orders.join(F.broadcast(users), "user_id", "left") \
.groupBy("country").agg(F.sum("amount").alias("revenue"))
# The learner observes:
# - Naive plan: full shuffle of 1 TB across all executors
# - Broadcast plan: users table sent to every executor, no shuffle on orders
# - Wall-clock typically 10-100x faster on the broadcast plan
Step-by-step explanation.
- The naive join shuffles 1 TB of orders by
user_idso that matching keys land on the same executor. This is the dominant cost. - The broadcast join sends the 10 MB users table to every executor instead, so the orders side does not shuffle at all. The 1 TB stays in place.
- AQE (
spark.sql.adaptive.enabled) lets Spark switch some shuffle joins to broadcast joins automatically when the planner sees the build side is small enough at runtime. - Skew handling (
spark.sql.adaptive.skewJoin.enabled) splits a single hot partition into many sub-partitions so the slowest executor does not block the whole stage. - The drill teaches the learner to read the Spark UI — find the stage that took the longest, find the partition that took the longest in that stage, and decide whether the fix is broadcast, repartition, or skew handling.
Output (illustrative wall-clock).
| Plan | Shuffle on orders | Wall clock |
|---|---|---|
| naive join | 1 TB | 32 min |
| broadcast join | 0 | 2 min 40 s |
| broadcast + AQE | 0 | 2 min 10 s |
Rule of thumb. Make Month 2 a tuning loop, not a tutorial. Pick one job, watch the Spark UI, fix one thing per run, and feel the cost drop. The intuition you build in those 20 runs is what makes you employable in Month 6.
Worked example — Month 6 capstone scope sketch
Detailed explanation. The capstone is the artefact you point hiring managers at. Scope it small enough to finish in 4 weeks, large enough to demonstrate every layer of the stack.
Question. Design a Month 6 capstone scope that touches all five layers (storage, compute, ingestion, orchestration, governance) and lands as a public GitHub repo with a system-design write-up.
Input — capstone constraints.
| Constraint | Value |
|---|---|
| Hours / week | 10 |
| Weeks | 4 |
| Budget | < $200 in cloud credit |
| Cloud | AWS free tier + small EMR cluster |
Code (capstone scope sketch).
project: nyc-taxi-streaming-lakehouse
storage:
bucket: s3://<your-bucket>/nyc-taxi/
table_format: Iceberg via AWS Glue catalog
compute:
batch: Spark on EMR (3 nodes, m5.xlarge)
streaming: Spark Structured Streaming
bi: Trino on EMR
ingestion:
source: NYC taxi open data dropped into Kafka via a synthetic producer
cdc: Debezium against a small Postgres of dim_zones
orchestration:
airflow_mwaa: small environment
dbt_core: silver / gold layer transforms
governance:
glue_catalog: Iceberg tables registered
openmetadata: free tier for lineage
deliverables:
- public github repo with code + iac
- architecture diagram (the five-layer map)
- cost dashboard screenshot
- on-call runbook (one page)
- blog post explaining design choices
Step-by-step explanation.
- The data set (NYC taxi) is public, large enough to feel real, and small enough to run on a few EMR nodes within budget.
- The streaming half (synthetic producer → Kafka → Spark Structured Streaming → Iceberg) demonstrates the modern reference architecture.
- The batch half (dbt models on top of the silver / gold layers) demonstrates SQL transform reliability.
- The orchestration + governance touch all five layers and produce the artefacts hiring teams care about: lineage, cost, and a runbook.
- The blog post + architecture diagram are the single most-read artefacts from a portfolio repo — invest more in them than in the code.
Output (artefact checklist).
| Artefact | Audience | Purpose |
|---|---|---|
| GitHub repo | engineers | code review proxy |
| Architecture diagram | hiring managers | "do they think in layers?" |
| Cost dashboard | finance-aware leaders | cost awareness signal |
| On-call runbook | SRE-leaning teams | operational maturity signal |
| Blog post | recruiters | search-engine + "do they communicate?" signal |
Rule of thumb. Treat the Month 6 capstone as a portfolio artefact, not a learning exercise. Polish the architecture diagram and the README more than the code itself — those are what get clicked first.
Big data interview question on the learning roadmap
A senior interviewer often probes: "If you were teaching a junior engineer big data from scratch in 2026, what order would you pick, and what would you tell them to skip?" The probe rewards candidates who can defend an opinionated order and call out outdated technologies without flinching.
Solution Using the SQL-first, Spark-second, capstone-last order
def roadmap_for(month):
plan = {
1: ("SQL mastery", "window funcs, CTEs, EXPLAIN, NULLs"),
2: ("PySpark + Spark SQL",
"DataFrame API, partitioning, broadcast, AQE"),
3: ("Kafka fundamentals",
"topics, partitions, consumer groups, exactly-once"),
4: ("Lakehouse + Iceberg / Delta",
"ACID, time travel, schema evolution, CDC merge"),
5: ("Orchestration + CI/CD",
"Airflow, dbt, GitHub Actions, Great Expectations"),
6: ("Capstone end-to-end",
"streaming + batch + cost dashboard + on-call runbook"),
}
return plan[month]
skip_in_2026 = [
"Pure Hadoop MapReduce",
"On-prem HDFS deployment",
"Raw Hive QL (use Spark SQL or Trino instead)",
"Pig",
"Oozie",
"Storm",
]
for m in range(1, 7):
title, body = roadmap_for(m)
print(f"Month {m}: {title:32s} | {body}")
print("\nSkip in 2026:", ", ".join(skip_in_2026))
Step-by-step trace.
| Month | Skill | Outcome chip |
|---|---|---|
| 1 | SQL mastery | "Senior-level fluency" |
| 2 | PySpark + Spark SQL | "Tune a 1 TB Spark job" |
| 3 | Kafka fundamentals | "Build a producer + consumer" |
| 4 | Lakehouse + Iceberg | "Ship an Iceberg CDC merge" |
| 5 | Orchestration + CI/CD | "Production-grade dbt project" |
| 6 | Capstone | "Portfolio repo + system design write-up" |
The trace makes the order defensible: each month builds on the previous one, and each closes with a measurable outcome chip. The "skip in 2026" list calls out the technologies that hiring teams no longer test, freeing the learner to focus on the ones they do.
Output:
| Order delivered | Skip list |
|---|---|
| SQL → Spark → Kafka → Lakehouse → Orchestration → Capstone | MapReduce, HDFS on-prem, Pig, Oozie, Storm, Tez |
Why this works — concept by concept:
- SQL first — universal interface to every engine and every analyst. Skipping it guarantees procedural-shaped PySpark.
- Spark second — universal compute engine. Most interview probes assume Spark fluency.
- Kafka third — the streaming substrate. Cannot defend the modern stack without it.
- Lakehouse fourth — ties storage + streaming + compute together with ACID, time travel, and schema evolution.
- Orchestration + CI/CD fifth — reliability layer. Junior engineers ship pipelines; senior engineers ship pipelines that recover.
- Capstone last — proof. The artefact you point hiring managers at.
- Cost — 6 months × 10 hours / week = ~240 hours. That is the minimum-viable investment to be employable as a 2026 big data engineer.
SQL
Topic — window functions
Window function problems for big data
Cheat sheet — big data recipes
-
Right-size a Spark cluster. 4 GB RAM per core, 2-4 cores per executor, and
parallelism ≈ executors × cores × 2. Start with these defaults; tune from the Spark UI, not from blog posts. -
Always partition lakehouse tables by ingestion date. Use
event_dateoringest_dateas the partition column unless you have a stronger access pattern. Iceberg's hidden partitioning handles the predicate pushdown. -
Broadcast join when one side is < 100 MB. 10-100× faster than the equivalent shuffle join. Enable
spark.sql.autoBroadcastJoinThresholdat 10 MB and let AQE upgrade larger broadcasts at runtime. - Kafka keys go to the same partition. Design your key for the access pattern, not the producer. If consumers process per-user state, key by user_id; if consumers shard work, key by a hash that distributes evenly.
-
Iceberg / Delta time travel covers every audit, GDPR, and replay question. Make
TIMESTAMP AS OFpart of your runbook. - Streaming exactly-once = idempotent sink + checkpointing + watermarks — all three or none. If any one of the three is missing, you do not have exactly-once; you have at-least-once with extra steps.
- "Just store it in S3 as Parquet" beats 80% of bespoke Hadoop pipelines in 2026. When in doubt, write Parquet to S3 partitioned by date and read with Trino. Optimise only when the SLA forces it.
- dbt for SQL transforms, Spark for everything else. dbt is faster to ship and easier to test for SQL-shaped work; Spark dominates when the transform needs Python, ML, or large joins outside the warehouse.
- Schema Registry is not optional. Run a Schema Registry (Confluent or Apicurio) from day one. The day someone changes a Kafka topic schema without it is the day your downstream pipeline silently corrupts a quarter of metrics.
-
Compact small files weekly. Iceberg's
rewrite_data_filesprocedure (or Delta'sOPTIMIZE) merges tiny files into ~256 MB blocks. Skip this and your query times degrade quietly over months. -
Snapshot expiration as a scheduled job. Iceberg keeps every snapshot until you tell it not to. Run
expire_snapshotsweekly to keep the metadata footprint sane. -
CDC merge pattern is one MERGE statement.
MERGE INTO target USING source ON target.pk = source.pk WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT. Memorise the shape — every streaming → lakehouse pipeline ends with it. - Cost monitoring is non-optional in 2026. Run a daily cost dashboard from day one (CUR for AWS, Billing Export for GCP). The team that ships first wins; the team that surfaces cost first stays in budget.
Frequently asked questions
Is Hadoop dead in 2026?
Hadoop as a deployment (HDFS + MapReduce + YARN on commodity hardware) is largely legacy — most teams are mid-migration off CDH or HDP onto S3 + Spark + Iceberg. Hadoop as a conceptual ancestor is alive and well: Spark inherited its partition / shuffle / executor mental model from Hadoop; Iceberg's spec descends from Hive's table abstraction; even Snowflake's micro-partitions are HDFS blocks rebranded. Skip the deployment, learn the ideas — they explain every optimisation choice in modern engines. The job market for "Hadoop administrator" is small and shrinking; the job market for "data engineer who can debug a Spark stage" is enormous.
Spark vs Flink — which should I learn first?
Learn Spark first. It is the universal batch + micro-batch engine, with the largest hiring footprint, the most mature ecosystem (PySpark, Spark SQL, Spark MLlib, Databricks), and the gentlest learning curve via DataFrames. Flink is the better engine for sub-second exactly-once streaming, but its API surface is more demanding and the job market is narrower (heavy in ad tech, fintech, and IoT). Once you can ship a tuned 1 TB Spark job, picking up Flink in a month for streaming-specific work is straightforward. Reversing the order (Flink first, then Spark) leaves you blind to ~70% of the typical hiring pipeline.
Do I need to learn Java / Scala for Spark, or is PySpark enough?
PySpark is enough for 95% of jobs in 2026, including most FAANG and FAANG-adjacent roles. The DataFrame API + Spark SQL surface is identical across Python, Scala, and Java; the optimiser (Catalyst) and execution engine (Tungsten) are language-agnostic. You only need Scala or Java when you write custom UDFs that must run inside the JVM for performance, contribute to Spark itself, or maintain a legacy Scala codebase. Start with PySpark, learn enough Scala to read other people's code (a weekend), and only invest deeply if your specific team's stack demands it.
What is a lakehouse and how is it different from a data lake?
A data lake is just files on object storage — Parquet on S3, no transactions, no schema enforcement, no time travel. A lakehouse is a data lake plus an open table format (Iceberg, Delta, Hudi) that adds ACID commits, schema evolution, time travel, hidden partitioning, and a catalog. The key shift is that the table format catalog — not the file system — becomes the source of truth. The practical difference: a lakehouse table can be safely written by multiple concurrent jobs, queried at a historical point in time, and read by Spark, Flink, Trino, and Snowflake from the same underlying files. A data lake offers none of those guarantees.
How much does a real big-data stack cost on AWS / GCP per month?
The honest answer is "$5K to $5M depending on data volume, latency SLA, and whether you choose managed services or roll your own." A mid-stage fintech with 500 GB / day, 20-minute SLA, and 50 concurrent readers usually lands in the $30K-$80K / month range using AWS managed services (S3 + Glue + EMR Spark + MSK Kafka + MWAA Airflow). Cost-aware teams cut that by 30-50% by moving idle batch workloads to serverless options (Databricks SQL Serverless, Snowflake auto-suspend) and by tightening Kafka retention. The single biggest cost dominator is usually compute on always-on clusters; the second is Snowflake or BigQuery scan costs if BI usage is unbounded.
Is "big data engineer" a separate job title from "data engineer" in 2026?
Increasingly, no. Most 2026 job postings use "data engineer" as the umbrella title and expect candidates to handle the full stack — SQL + Spark + Kafka + lakehouse + orchestration. Where you still see "big data engineer" as a distinct title, it usually means a specialist who lives in the streaming / distributed-systems half of the stack (Flink, Kafka exactly-once, custom Spark internals) — typically at FAANG, ad tech, fintech, or IoT companies with very high data velocity. The career advice is unchanged: build the full-stack data engineer profile first; specialise into "big data engineer" only after you have shipped two or three full lakehouse + streaming pipelines.
Practice on PipeCode
- Drill the ETL practice library → for end-to-end pipeline design and the batch-vs-streaming trade-off.
- Rehearse on streaming problems → for the Kafka / Flink / Spark Structured Streaming surface and exactly-once semantics.
- Sharpen aggregation problems → for the big-table GROUP BY / window-function family that dominates Spark SQL interviews.
- Stack the optimization library → for distributed query tuning, broadcast joins, partition pruning, and skew handling.
- Layer the joins practice library → for shuffle vs broadcast vs sort-merge join choices.
- For the broader interview surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Sharpen the engine axis with the Apache Spark internals course → and the PySpark fundamentals course →.
- Round out the architecture axis with the ETL system design course → and the data modelling course →.
Pipecode.ai is Leetcode for Data Engineering — every layer in this 2026 big data roadmap ships with hands-on practice rooms where you write the PySpark broadcast join, the Kafka producer, the Iceberg CDC merge, and the Flink exactly-once sink against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so by the time you finish the 6-month ladder you have not just read about `big data engineering` — you have shipped it.





Top comments (0)