DEV Community

Cover image for Lambda vs Kappa Architecture: When Each Wins in 2026
Gowtham Potureddi
Gowtham Potureddi

Posted on

Lambda vs Kappa Architecture: When Each Wins in 2026

lambda architecture is one of those phrases that signals a decade of data-platform history in a single keyword — invented in 2011 by Nathan Marz for the Twitter / BackType firehose, codified as "batch layer authoritative, speed layer approximate, serving layer merges both," and quietly inherited by every batch-plus-streaming data platform built in the early-to-mid 2010s. Jay Kreps's 2014 critique gave us kappa architecture — the stream-as-source-of-truth alternative — and the lambda vs kappa debate dominated platform-architecture interviews for the next eight years.

The 2026 reality is that both are increasingly legacy for new builds. Iceberg and Delta have collapsed storage into a single ACID lakehouse layer; Flink, Materialize, and RisingWave have collapsed the query surface into Streaming SQL; and the dual-code tax that justified Lambda in 2011 is paid by a shrinking minority of teams. This guide walks through both architectures from first principles — the batch and streaming architecture they each propose, the streaming reprocessing patterns they imply, the lakehouse architecture that replaces them, and the unified batch streaming SQL surface that makes the decision rule "neither, use a Lakehouse" for most new pipelines. Each section pairs a teaching block with a Solution-Tail engineering answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

PipeCode blog header for Lambda vs Kappa Architecture in 2026 — bold white headline 'Lambda vs Kappa' with subtitle 'Batch · Speed · Serving · Reprocessing' and a split-scene diagram showing a dual-track Lambda pipeline on the left and a single-stream Kappa pipeline on the right, on a dark gradient with purple, green, orange, and blue accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, drill the streaming practice library →, rehearse on event processing problems →, and stack the architecture muscles with real-time analytics drills →.


On this page


1. Why Lambda architecture was invented in 2011

Lambda was Nathan Marz's CAP-aware compromise for a world that did not yet have Kafka or Iceberg

The one-sentence invariant: Lambda architecture says "compute every metric twice — once in a slow, authoritative batch layer and once in a fast, approximate speed layer — then merge them in a serving layer at query time". Once you internalise that "batch is the source of truth, speed is the latency patch," the entire family of lambda architecture design questions becomes a sequence of "where does this metric live, and who recomputes it?" deductions.

The three properties Lambda solves.

  • Fault-tolerance. The batch layer is built on an immutable, append-only master dataset stored on HDFS or S3. If a stream-processing bug corrupts the speed layer's output, the batch layer's next run overwrites the bad numbers without manual intervention. The system is self-healing by construction.
  • Low-latency reads. Without a speed layer, the freshest data is whatever the batch layer last computed — typically 1 to 24 hours stale. The speed layer covers the gap between "last batch run" and "now" with approximate computations.
  • Batch-grade accuracy. Batch jobs see the entire dataset, can re-sort, can re-window, can use expensive operators that streaming engines avoid. They produce the "real" answer; the speed layer is only ever a stop-gap until batch catches up.

Three places Lambda was the right tradeoff in 2011.

  • The pre-Kafka world. Kafka 0.7 was the first public release in 2011 and did not yet have replication, compaction, or exactly-once semantics. Treating a Kafka cluster as the immutable master dataset was operationally risky. HDFS was the safer authoritative store, and Storm was the streaming engine of choice for the speed layer.
  • Storage was expensive. Keeping a year of events on HDFS plus the same year on Kafka would have doubled storage costs. Lambda kept one authoritative copy (HDFS) and a short retention window on Kafka.
  • Streaming engines were immature. Storm offered at-least-once semantics; exactly-once was years away. Trusting the streaming layer with authoritative numbers would have meant trusting at-least-once with money — most teams refused.

What interviewers listen for on lambda architecture probes.

  • Do you mention Nathan Marz and the year 2011 when asked who invented Lambda? — senior signal.
  • Do you name the three layers (batch, speed, serving) correctly without prompting? — required answer.
  • Do you call out the immutable master dataset as the load-bearing invariant, not just "HDFS"? — senior signal.
  • Do you say "Lambda is mostly legacy for new builds in 2026" without sounding dogmatic? — staff signal.

The 2026 reality check.

  • Kafka now has tiered storage, transactions, and exactly-once semantics — most of Lambda's 2011 "Kafka is too risky" rationale is gone.
  • Iceberg and Delta give us ACID tables on S3 — the immutable master dataset is now first-class storage, not "an HDFS directory we promise not to mutate."
  • Flink, Spark Structured Streaming, and Beam offer exactly-once across the full pipeline — the speed layer is no longer "approximate."
  • Most teams now have all three pieces, which is why Lambda's tradeoffs no longer pay for themselves on new builds.

Worked example — what Lambda actually buys you in 2011

Detailed explanation. A small clickstream team at a startup wants the last 30 days of "events per user per hour" for an analytics dashboard. In 2011, the obvious "pure batch" option (Hive query that runs every 6 hours) would mean the dashboard lags by up to 6 hours; the "pure stream" option (Storm topology) would mean any Storm bug or restart silently corrupts the metrics. Lambda gives both — the batch layer recomputes the canonical numbers every 6 hours, and the speed layer fills the trailing 6-hour window.

Question. Sketch the data flow for a Lambda implementation of "events per user per hour" — what runs in batch, what runs in speed, and how does the serving layer merge them?

Input — events arriving on a Kafka topic.

event_id user_id event_ts
1 u1 2026-06-10 09:14:00
2 u2 2026-06-10 09:42:00
3 u1 2026-06-10 10:01:00
4 u1 2026-06-10 14:33:00

Code (the three layers as pseudo-config).

# Batch layer — Spark job, runs every 6h, recomputes from HDFS
batch_job:
  input: hdfs://master_dataset/events/{YYYY}/{MM}/{DD}/
  output: hdfs://batch_views/events_per_user_per_hour/
  query: |
    SELECT user_id, date_trunc('hour', event_ts) AS bucket, COUNT(*) AS c
    FROM events
    GROUP BY user_id, bucket
  schedule: "0 */6 * * *"

# Speed layer — Storm topology, runs continuously, covers last 6h
speed_topology:
  input: kafka://events
  output: hbase://speed_views/events_per_user_per_hour/
  semantics: at-least-once
  state_ttl: 7h  # slightly longer than batch interval

# Serving layer — HBase query
serving_query: |
  SELECT user_id, bucket,
         COALESCE(batch.c, 0) + COALESCE(speed.c, 0) AS events_per_hour
  FROM batch_views batch
  FULL OUTER JOIN speed_views speed USING (user_id, bucket)
  WHERE bucket >= CURRENT_DATE - INTERVAL '30 days';
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Every event lands on the Kafka topic and is also written to HDFS (a dual-write — the original Lambda design assumed this). The HDFS copy becomes the immutable master dataset.
  2. The batch Spark job reads HDFS every 6 hours, recomputes the 30-day aggregate from scratch, and writes the result to a batch view directory.
  3. The speed Storm topology subscribes to the same Kafka topic, maintains a rolling 6-hour aggregate per user-hour in HBase, and updates it on every event.
  4. The serving query FULL OUTER JOINs the batch and speed views on (user_id, bucket). For hours older than 6h, the batch view dominates (speed has expired the state). For the most recent hour, only the speed view is present.
  5. When the batch job's next run completes, the merged answer for the trailing hours snaps to the batch number — silently overwriting any drift from the speed layer.

Output (querying the dashboard at 14:35 on 2026-06-10).

user_id bucket batch.c speed.c events_per_hour
u1 2026-06-10 09:00 1 NULL 1
u1 2026-06-10 10:00 1 NULL 1
u2 2026-06-10 09:00 1 NULL 1
u1 2026-06-10 14:00 NULL 1 1

Rule of thumb. Lambda buys you a self-healing system at the cost of writing every metric twice and operating two engines. In 2011, with immature streaming and expensive Kafka, that cost was acceptable for production analytics. In 2026, with mature Streaming SQL on Iceberg / Delta, it almost never is.

Worked example — why "pure stream" was not viable in 2011

Detailed explanation. The Storm topology that powers the speed layer is at-least-once: every event is processed at least once, but on failure / restart a few events may be processed twice. For a counter, that means the value can drift upward by 1-5% over a long-running deployment. In a "pure stream" world without a batch layer to self-heal, that drift becomes permanent — and finance / reporting teams cannot accept that.

Question. Show a small Storm bolt that double-counts on restart and demonstrate the divergence after 1000 events with one simulated restart.

Input — events on a Kafka topic.

event_id user_id
1..500 u1
501..1000 u1

Code (pseudo-Storm bolt — at-least-once counter).

class CountBolt:
    def __init__(self):
        self.counts = {}  # user_id -> count
        self.last_acked_offset = 0

    def execute(self, tuple):
        user_id = tuple["user_id"]
        offset = tuple["kafka_offset"]
        self.counts[user_id] = self.counts.get(user_id, 0) + 1
        # at-least-once: ack AFTER the side-effect; if process dies here,
        # the offset is NOT acked and the event is re-delivered.
        self.last_acked_offset = offset
        self.ack(tuple)

    def restore_from_kafka_offset(self):
        # On restart, Storm replays from last_acked_offset + 1.
        # But self.counts is in-memory only — it starts at 0 on restart.
        # So events between checkpoint and crash are RE-COUNTED.
        pass
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The bolt processes events 1..500 for user u1 and emits counts[u1] = 500. The Storm process crashes at offset 500 before checkpointing the in-memory counts map.
  2. Storm restarts. It re-reads from the last committed Kafka offset — say offset 400 (the most recent checkpoint was 100 events ago).
  3. The bolt processes events 400..500 again, incrementing counts[u1] by 100. The bolt then continues with events 501..1000.
  4. Final value: counts[u1] = 100 (re-counted) + 600 (clean) = 700. The correct value is 1000. The bolt is wrong by 30% because of one restart.
  5. In a true Kappa world we would fix this with exactly-once semantics (Flink + Kafka transactions). In 2011, the only safe fallback was to let a nightly batch job overwrite the wrong number — which is exactly what Lambda's batch layer does.

Output (counts after 1000 real events with one simulated restart).

layer reported count actual events drift
Speed (Storm at-least-once) 700 1000 -30%
Batch (Spark over HDFS) 1000 1000 0%
Lambda merged after next batch 1000 1000 0% (batch overwrites)

Rule of thumb. The original Lambda invariant is "the speed layer may be wrong; the batch layer is right; the serving layer trusts batch the moment batch is fresh enough." If your streaming engine is exactly-once (Flink + Kafka transactions in 2026), that invariant no longer pays for itself.

Data engineering interview question on Lambda's three layers

A senior interviewer often opens with: "Walk me through why anyone would build a Lambda architecture in 2011, and why most teams have moved off it. Be specific about which 2011 constraints have changed by 2026 and which still apply." It probes whether the candidate has internalised the historical context, not just memorised the textbook diagram.

Solution Using a layer-by-layer audit of the 2011-vs-2026 tradeoff

# A small decision matrix — each row is a layer constraint.
# Mark whether 2011 forced the Lambda answer and whether 2026 still does.
decision_matrix = [
    # (constraint, 2011_forces_lambda, 2026_forces_lambda, why_2026_differs)
    ("Kafka exactly-once",        True,  False,
        "Kafka transactions (KIP-98) ship in 2017; Flink + Kafka are EOS."),
    ("Immutable master dataset",  True,  False,
        "Iceberg / Delta provide ACID tables on S3 with time travel."),
    ("Streaming engine maturity", True,  False,
        "Flink + Spark Structured Streaming are production-grade."),
    ("Batch-grade reprocessing",  True,  False,
        "Replay from Kafka + Iceberg time travel covers reprocessing."),
    ("Regulated batch jobs",      True,  True,
        "Some industries still require a separate authoritative batch run."),
    ("Two-team org structure",    True,  True,
        "If batch and stream live in different teams, Lambda mirrors org."),
]

def lambda_recommendation(year: int) -> dict:
    forcing = [r for r in decision_matrix if (r[1] if year == 2011 else r[2])]
    return {
        "year": year,
        "forcing_constraints": len(forcing),
        "verdict": "use Lambda" if len(forcing) >= 3 else "prefer Lakehouse + Streaming SQL",
        "constraints": [r[0] for r in forcing],
    }

print(lambda_recommendation(2011))
print(lambda_recommendation(2026))
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

year forcing constraints found verdict
2011 6 (all rows TRUE) use Lambda
2026 2 (regulated batch + org structure) prefer Lakehouse + Streaming SQL

The trace highlights that four of the six constraints that forced Lambda in 2011 have been eliminated by 2026 technology — Kafka EOS, Iceberg / Delta, mature streaming engines, and Kafka-based replay. The two that remain are organisational, not technical.

Output:

year forcing verdict
2011 6 use Lambda
2026 2 prefer Lakehouse + Streaming SQL

Why this works — concept by concept:

  • Constraint-by-constraint audit — instead of arguing "Lambda vs Kappa" in the abstract, list every constraint Lambda was meant to solve and ask which still hold in 2026. Four of six have been retired by Kafka transactions, Iceberg / Delta, mature streaming engines, and replay-friendly retention.
  • Immutable master dataset — was the load-bearing assumption in 2011 (HDFS write-once). Iceberg and Delta give us the same property as first-class ACID tables on S3 with time-travel, eliminating the need for a separate "raw HDFS" layer.
  • Exactly-once semantics — Kafka KIP-98 (2017) plus Flink's checkpoint + two-phase-commit sink let us trust the streaming layer with authoritative numbers. The "speed is approximate" invariant is no longer a forced choice.
  • Org-structure forcing — even in 2026, if a regulated batch team owns the authoritative numbers and a separate streaming team owns dashboards, Lambda survives as a Conway's Law artefact, not a technical necessity.
  • Cost — the audit is a 30-minute architecture exercise; the migration off Lambda is a multi-quarter program. The audit's value is naming the work, not doing it. Big-O: O(constraints) — trivial compute.

SQL
Topic — streaming
Streaming architecture problems (SQL)

Practice →


2. Anatomy of Lambda — batch + speed + serving + the duplicate-code tax

Lambda's three layers are easy to draw and brutal to operate — the cost is the duplicate-code tax

The mental model in one line: batch layer = authoritative, hours behind; speed layer = approximate, seconds behind; serving layer = merges both at query time so the client sees one answer. Once you can sketch the three-layer cake in 30 seconds, the second-order questions — duplicate code, semantic drift, integration testing across engines — fall out naturally.

Visual three-layer-cake diagram of Lambda architecture — bottom batch layer (HDFS/S3 + Spark/MapReduce computing batch views), middle speed layer (Storm/Flink computing real-time views over the last few hours), top serving layer (HBase/Cassandra/Druid) with a merge query union-all'ing batch and speed views; a small annotation about the merge cutoff timestamp; on a light PipeCode card.

The three layers in detail.

  • Batch layer. An immutable append-only master dataset on HDFS or S3. Periodic batch jobs (Spark, MapReduce, Hive) read the entire dataset and recompute the batch views from scratch. The schedule is usually every 1 to 24 hours. The output is canonical.
  • Speed layer. A streaming engine (Storm, Flink, Spark Streaming, Samza) subscribes to the same event source and maintains speed views covering only the window since the last batch run completed. Output is approximate (at-least-once semantics in the early days; exactly-once in modern Flink). The state TTL is sized to slightly overlap the batch schedule.
  • Serving layer. A low-latency database (HBase, Cassandra, Druid, ClickHouse, ElasticSearch) that the client queries. The query is some form of batch_view UNION ALL speed_view with a cutoff timestamp deciding which layer's number wins for a given time window.

The merge query — three variants.

  • Union-all with cutoff. SELECT * FROM batch WHERE bucket < batch_max_ts UNION ALL SELECT * FROM speed WHERE bucket >= batch_max_ts. Simple, but you need to know batch_max_ts — usually written into a sentinel row by the batch job.
  • Full outer join + COALESCE. SELECT bucket, COALESCE(batch.c, speed.c) FROM batch FULL OUTER JOIN speed USING (bucket). Lets batch "overwrite" speed for any bucket they both contain, which is the canonical Lambda invariant.
  • Add the two. batch.c + speed.c works only if the two layers are guaranteed to be disjoint by bucket. Drift in the cutoff timestamp causes double-counting. Avoid in production.

The duplicate-code tax.

  • Two codebases. Spark in Python or Scala for batch; Flink or Storm in Java or Scala for speed. Even when both can be written in Scala, the streaming and batch APIs are not the same — windowing, watermarking, late-event handling all differ.
  • Two infra footprints. A YARN / EMR cluster for batch; a Flink / Storm cluster for speed. Two on-call rotations, two deployment pipelines, two monitoring stacks.
  • Semantic drift. Subtle differences in null handling, rounding, time-zone conversion, or join semantics produce different numbers across the two engines. Every Lambda team eventually files a "why does Spark say 1042 and Flink say 1039?" ticket.
  • Integration testing across engines. End-to-end tests have to seed both layers, exercise the merge query, and assert the combined answer matches an oracle. The matrix of failure modes scales as batch_bug × speed_bug × merge_bug.

Where Lambda still legitimately wins in 2026.

  • Regulated batch. If a regulator (financial, healthcare, pharma) requires that the authoritative numbers be produced by a separate, auditable batch job — Lambda's batch layer is the perfect home for that requirement.
  • Legacy frozen pipelines. A batch job that has run nightly for 8 years and survived 14 platform migrations should not be rewritten just to align with a Streaming SQL fashion. Wrap it; leave it; budget its retirement for later.
  • Tiny streaming teams. If you have one DE who can build Spark batch jobs but not Flink streaming jobs, hiring a stream engineer to migrate to Kappa may cost more than running the dual stack.

Worked example — the merge query, written three ways

Detailed explanation. Three teams inherit the same Lambda pipeline. Each writes the serving-layer merge query slightly differently — UNION ALL with cutoff, FULL OUTER JOIN with COALESCE, and naive addition. Two of them are correct; one of them double-counts on the cutoff boundary. Spotting which one is the interviewer's favourite probe.

Question. Given batch view (computed at 14:00 covering all hours up to 14:00) and speed view (covering 14:00 onward), write the three merge variants and show which one double-counts the 14:00 hour.

Input — batch_views (computed at 14:00 covering data through 13:59).

user_id bucket c
u1 2026-06-10 12:00 5
u1 2026-06-10 13:00 3

Input — speed_views (covering data since 13:00 due to a generous overlap).

user_id bucket c
u1 2026-06-10 13:00 3
u1 2026-06-10 14:00 2

Code.

-- Variant 1 — UNION ALL with explicit cutoff
-- Cutoff = batch_max_ts = '2026-06-10 14:00'
SELECT user_id, bucket, c FROM batch_views WHERE bucket < '2026-06-10 14:00'
UNION ALL
SELECT user_id, bucket, c FROM speed_views WHERE bucket >= '2026-06-10 14:00';

-- Variant 2 — FULL OUTER JOIN + COALESCE (batch wins ties)
SELECT
    COALESCE(b.user_id, s.user_id)  AS user_id,
    COALESCE(b.bucket,  s.bucket)   AS bucket,
    COALESCE(b.c,       s.c)        AS c
FROM batch_views b
FULL OUTER JOIN speed_views s USING (user_id, bucket);

-- Variant 3 — naive addition (BROKEN — double-counts the 13:00 row)
SELECT
    COALESCE(b.user_id, s.user_id)  AS user_id,
    COALESCE(b.bucket,  s.bucket)   AS bucket,
    COALESCE(b.c, 0) + COALESCE(s.c, 0) AS c
FROM batch_views b
FULL OUTER JOIN speed_views s USING (user_id, bucket);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Variant 1 (UNION ALL with cutoff) emits batch rows for buckets < 14:00 and speed rows for buckets >= 14:00. The 13:00 bucket exists in both views — but the cutoff says "batch wins for < 14:00" so only the batch row is emitted. Correct.
  2. Variant 2 (FULL OUTER JOIN + COALESCE) takes the batch number when both views have a row for the same bucket. For the 13:00 bucket, both have c=3; COALESCE returns 3. For the 14:00 bucket, only speed has data; COALESCE returns 2. Correct.
  3. Variant 3 (naive addition) adds the two views row-by-row. For the 13:00 bucket: 3 (batch) + 3 (speed) = 6. Double counted. The bug shows up only on the overlap window — and only after the batch view catches up to the speed view.
  4. The bug in Variant 3 is silent: the dashboard reports 6 for an hour that genuinely had 3 events. No exception, no warning, no log. This is the most-common Lambda data-quality bug.

Output (each variant's c for bucket 2026-06-10 13:00).

Variant bucket 13:00 c correct?
UNION ALL with cutoff 3 yes
FULL OUTER JOIN + COALESCE 3 yes
naive addition 6 NO — double count

Rule of thumb. Either pick a clean cutoff timestamp and UNION-ALL (Variant 1), or let batch overwrite speed via COALESCE (Variant 2). Never add the two views. If you must add, materialise the cutoff window into a single "pending" view that one engine owns, not both.

Worked example — the duplicate-code tax in practice

Detailed explanation. A team owns one business metric — "average revenue per active user in the last 7 days." They wrote it in Spark for the batch layer and in Flink for the speed layer. After three months, the two numbers drift by 0.3%. The bug hunt reveals four sources of drift: watermarking, null handling, rounding, and join semantics.

Question. Show the Spark batch implementation and the Flink streaming implementation of the same metric, and identify each source of drift.

Input. A transactions event stream with columns (user_id, amount, event_ts).

Code.

Visual diagram of the Lambda duplicate-code tax — same business logic implemented twice (left Spark batch job in Scala/Python, right Flink streaming job in Java) with diverging results visualised; a centre divergence arrow showing batch_total = 1042 vs speed_total = 1039 because of subtle semantic drift; on a light PipeCode card.

# Batch layer — PySpark, runs every 4h on the master dataset
from pyspark.sql import functions as F

batch_df = (
    spark.read.parquet("s3://master/transactions/")
    .where(F.col("event_ts") >= F.expr("current_timestamp() - INTERVAL 7 DAYS"))
    .groupBy("user_id")
    .agg(F.sum("amount").alias("total"), F.count("*").alias("txns"))
    .filter(F.col("txns") > 0)
    .agg(F.avg("total").alias("avg_revenue_per_active_user"))
)
Enter fullscreen mode Exit fullscreen mode
// Speed layer — Flink, runs continuously
DataStream<Transaction> txns = env.fromSource(kafkaSource, ...);
txns
    .keyBy(t -> t.userId)
    .window(SlidingEventTimeWindows.of(Time.days(7), Time.hours(1)))
    .aggregate(new SumAndCount())  // returns (sum, count)
    .filter(r -> r.count > 0)
    .windowAll(SlidingEventTimeWindows.of(Time.hours(1)))
    .aggregate(new AvgOfSums());   // returns avg
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Watermarking drift. The batch job filters by event_ts >= now - 7 days in processing time. The Flink job uses event-time sliding windows. A transaction with event_ts = now - 7d 1h arrives 30 seconds before the batch job runs — batch excludes it; Flink (with a 1-hour allowed lateness) includes it.
  2. Null handling drift. Spark's F.sum("amount") returns NULL when every row in the group is NULL; Flink's SumAndCount aggregator initialised to 0 returns 0. The downstream filter(txns > 0) behaves differently on the two — Spark drops NULL totals; Flink keeps zero totals.
  3. Rounding drift. Spark uses DECIMAL(18, 2) by default; Flink's Double is binary floating point. Summing 100,000 transactions of $0.10 each gives Spark exactly $10,000.00 and Flink $9999.999...something — a 0.0001% drift that compounds across users.
  4. Join semantics drift. Even though this metric has no explicit join, the anti-join with inactive users is implicit in filter(txns > 0). Spark's catalyst optimiser sometimes pushes the filter below the aggregate; Flink does it after. The two execution orders produce the same answer in theory but different intermediate states under failure / restart.

Output (after 90 days of drift).

layer reported avg actual drift
Batch (Spark, every 4h) $42.13 $42.10 +0.07%
Speed (Flink, continuous) $42.27 $42.10 +0.40%
Lambda merged $42.13 (batch wins) $42.10 +0.07%

Rule of thumb. The duplicate-code tax is not "the code is twice as long" — it is "every drift between engines becomes a permanent line item on the on-call runbook." Audit semantic drift before audit number drift; the four classical sources are watermarking, null handling, rounding, and join semantics.

Worked example — testing the merge query in CI

Detailed explanation. A clean test for the serving-layer merge query seeds both batch and speed views with known inputs, runs the merge, and asserts the result matches the oracle. The trick is making the test deterministic across Spark and Flink — which usually means freezing the wall clock and event time.

Question. Write a parameterised pytest that seeds (batch_rows, speed_rows, expected_merged_rows) and asserts the merge query is correct under three scenarios: pure-batch, pure-speed, and overlap.

Input — three test cases.

case batch_rows speed_rows expected merged
pure-batch 5 hours of data 0 5
pure-speed 0 2 hours of data 2
overlap 5 hours of data last 1h overlaps 5

Code.

import pytest

def merge_batch_speed(batch_rows, speed_rows):
    """COALESCE-style merge: batch wins ties, speed fills gaps."""
    merged = {}
    for r in batch_rows:
        merged[r["bucket"]] = r["c"]
    for r in speed_rows:
        merged.setdefault(r["bucket"], r["c"])
    return [{"bucket": b, "c": c} for b, c in sorted(merged.items())]

@pytest.mark.parametrize("batch,speed,expected", [
    # Case 1 — pure batch
    ([{"bucket": "09:00", "c": 5}], [], [{"bucket": "09:00", "c": 5}]),
    # Case 2 — pure speed
    ([], [{"bucket": "14:00", "c": 2}], [{"bucket": "14:00", "c": 2}]),
    # Case 3 — overlap; batch should win the 13:00 row
    (
        [{"bucket": "13:00", "c": 3}],
        [{"bucket": "13:00", "c": 99}, {"bucket": "14:00", "c": 2}],
        [{"bucket": "13:00", "c": 3}, {"bucket": "14:00", "c": 2}],
    ),
])
def test_merge(batch, speed, expected):
    assert merge_batch_speed(batch, speed) == expected
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The merge helper implements the COALESCE-style rule: batch wins ties. setdefault does not overwrite existing keys, so a speed row for a bucket already in batch is ignored.
  2. Case 1 (pure batch) emits the batch row as-is. Case 2 (pure speed) emits the speed row as-is. Case 3 (overlap) emits the batch row for 13:00 (winning the tie) and the speed row for 14:00 (no batch row exists).
  3. The test fixes deterministic inputs — no timestamps that depend on now(), no random user_ids. Running the same suite in CI gives identical results every time.
  4. To extend to "real" Spark + Flink: replace merge_batch_speed with a SQL query against two seeded tables in a DuckDB or Trino test harness. The shape of the assertion stays the same.

Output (pytest run).

case actual expected pass?
pure-batch [{bucket: 09:00, c: 5}] same yes
pure-speed [{bucket: 14:00, c: 2}] same yes
overlap [{13:00: 3, 14:00: 2}] same yes

Rule of thumb. Every Lambda pipeline should ship with a parameterised merge-query test that covers pure-batch, pure-speed, and overlap. The cost is one afternoon; the benefit is catching the "naive addition" bug at PR time instead of at the dashboard-discrepancy ticket three quarters later.

Data engineering interview question on Lambda's duplicate-code tax

A senior interviewer might frame this as: "Your team owns a Lambda pipeline that has accumulated four metrics where Spark and Flink disagree by a small percentage. You have one quarter to either (a) reconcile both engines or (b) migrate off Lambda. Defend your choice." It probes engineering economics, not architecture trivia.

Solution Using a cost-vs-value decomposition

def lambda_remediation_cost(metrics: int, engineers: int, weeks: int) -> dict:
    """
    Estimate the cost of two strategies given a fixed quarter:
      a) reconcile_both  — fix drift in place; keep dual stack
      b) migrate_off     — move to Lakehouse + Streaming SQL
    """
    # Strategy A — reconcile drift, keep dual stack
    drift_audit_cost = metrics * 2          # eng-weeks per metric
    drift_fix_cost   = metrics * 3
    integration_test = metrics * 1
    dual_ongoing     = 4 * 4                # 4 weeks/qtr forever
    total_a = drift_audit_cost + drift_fix_cost + integration_test + dual_ongoing

    # Strategy B — migrate to Lakehouse + Streaming SQL once
    migration       = metrics * 4           # rewrite once in Streaming SQL
    cutover_risk    = metrics * 1           # parallel-run buffer
    legacy_decom    = 2
    total_b = migration + cutover_risk + legacy_decom

    cap = engineers * weeks
    return {
        "capacity_eng_weeks": cap,
        "strategy_a_cost":    total_a,
        "strategy_b_cost":    total_b,
        "recommend":          "migrate" if total_b < total_a else "reconcile",
    }

print(lambda_remediation_cost(metrics=4, engineers=3, weeks=12))
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step value
capacity (3 eng × 12 wk) 36 eng-weeks
strategy A — reconcile drift 2*4 + 3*4 + 1*4 + 16 = 40 eng-weeks
strategy B — migrate 4*4 + 1*4 + 2 = 22 eng-weeks
recommendation migrate (B is 18 eng-weeks cheaper than A)
recurring savings (B vs A) 16 eng-weeks/quarter saved forever

The decomposition turns an architecture debate into an arithmetic question. Strategy A also leaves the next set of drift bugs on the runbook; Strategy B amortises the work once.

Output:

strategy one-time cost recurring cost/qtr total over 1 yr
reconcile (A) 24 eng-wk 16 eng-wk/qtr 88 eng-wk
migrate (B) 22 eng-wk 0 (dual stack gone) 22 eng-wk

Why this works — concept by concept:

  • Decompose the cost into one-time + recurring — every Lambda remediation has both. Reconciling drift is one-time; running two engines is recurring. The recurring component dominates over a 1-year horizon.
  • Audit before fix — the drift_audit_cost line item is the cheap insurance: cataloguing every drift source before touching code stops the "fix the symptom, miss the root cause" loop that drift bugs love.
  • Parallel-run for cutover — Strategy B's cutover_risk budget covers running the new Lakehouse stack alongside the legacy stack for one batch cycle to compare numbers. This is the only safe way to retire a Lambda pipeline.
  • Conway's-Law sanity check — if the batch and stream teams are separate orgs, the migration is also an org-change project, not just a tech project. Add 25-50% to the migration estimate.
  • Cost — the decomposition itself is O(metrics); the migration is O(metrics × LOC); the savings are linear in the number of dual-implemented metrics. Big-O: O(N) per quarter, where N is the metric count.

SQL
Topic — event processing
Event processing problems (SQL)

Practice →


3. Kappa unified model — stream as source of truth

Kappa drops the batch layer and bets that Kafka + Flink can serve as both the log and the compute engine

The mental model in one line: Kappa says "Kafka is the immutable log; one stream processor (Flink, Kafka Streams, Samza) computes every view; reprocessing is just replaying the stream from an older offset". Once you internalise "the log is the master dataset," the entire kappa architecture family of interview probes — reprocessing, retention, replay throughput, exactly-once — collapses into a single mental model.

Visual diagram of Kappa architecture — Kafka as immutable log on the left feeding a single stream processor (Flink/Kafka Streams/Samza) in the middle, which writes to a serving layer on the right; below it a 'reprocessing by replay' band showing the same stream processor reading from an older Kafka offset to recompute history; on a light PipeCode card.

Jay Kreps's 2014 proposal in one paragraph. Kreps argued that Lambda's duplicate-code tax was an artefact of immature streaming tools, not a fundamental architectural requirement. If Kafka could be operated with long retention and the streaming engine offered exactly-once, the batch layer was redundant — every "batch" computation could be re-expressed as a stream replayed from offset 0. The result was Kappa: one log, one engine, one codebase.

The four invariants of Kappa.

  • Kafka is the master dataset. Topics are configured with infinite (or very long) retention. Every event is persisted, replayable, and immutable. Compacted topics extend the model to slowly-changing dimensions.
  • One stream processor. Flink, Kafka Streams, or Samza reads the topic and computes the views. There is no separate batch engine — by construction.
  • Reprocessing is replay. Need to fix a bug in the windowing logic? Deploy the new code, point it at an older Kafka offset, and recompute the views. There is no "batch overwrite"; there is only "stream replay."
  • The serving layer holds materialised views. The output of the stream processor is written to a low-latency view store (RocksDB-backed state, Cassandra, Druid, Elasticsearch, or — increasingly — a materialised view in Materialize or RisingWave).

Where Kappa wins.

  • Append-mostly event streams. Clickstream, application logs, IoT telemetry — workloads where every event is a fact, never a mutation. Kafka is a perfect home for these.
  • Sub-second latency requirements. A dashboard that must reflect events within 1 second of arrival cannot wait for a batch run. Flink's exactly-once incremental processing is the only viable engine.
  • ML feature pipelines. Real-time features (last-5-minute click count, sliding-window CTR) are naturally streaming computations. A batch layer would add 24-hour latency for no benefit.
  • One small team. A single engineering team that owns one engine, one codebase, and one mental model can move faster than a Lambda team with two of everything.

Where Kappa hurts.

  • Heavy historical backfills. Replaying 5 years of events from Kafka takes hours of throughput; a batch job scanning S3 directly would finish in 15 minutes. Kafka's replay is bottlenecked on partition count and consumer parallelism.
  • Complex joins to slowly-changing dimensions. Joining a high-volume event stream against a 50-million-row dimension table is awkward in Flink (state size explodes). Spark on S3 handles it natively.
  • Regulatory reprocessing windows. "Recompute Q4 2024 with the new tax rule" implies a full historical run. If the original Kafka topic has been retention-evicted, Kappa cannot do this without an external archive — at which point you have re-invented Lambda's batch layer.
  • Long retention costs. Kafka tiered storage on S3 closes the cost gap, but for some workloads the storage bill is still 2-5× a pure-batch Parquet archive.

Common interview probes on kappa architecture.

  • "What is the difference between Lambda and Kappa?" — Lambda has a separate batch layer; Kappa uses the stream as the source of truth.
  • "How does Kappa handle reprocessing?" — by replaying from an older Kafka offset, optionally to a new output topic / view (the "side-deploy" pattern).
  • "What is the side-deploy pattern?" — deploy v2 of the stream job alongside v1, point v2 at an older offset, switch reads to v2's view when caught up.
  • "When would you not pick Kappa?" — heavy historical backfills, complex SCD joins, regulated batch jobs, or org structures where streaming expertise is concentrated in one team.

Worked example — sliding-window CTR in Flink (Kappa style)

Detailed explanation. A clickstream pipeline computes "click-through rate per ad over the last 5 minutes." In a Kappa world, this is a single Flink job reading impressions and clicks from Kafka, keyed by ad_id, with a sliding event-time window. The output is written to a materialised view; the dashboard queries the view directly.

Question. Write the Flink job and show the state size, output rate, and how reprocessing the same job from offset 0 would behave.

Input — two Kafka topics.

topic fields
impressions ad_id, event_ts
clicks ad_id, event_ts

Code (Flink SQL).

-- Define the source tables on Kafka topics
CREATE TABLE impressions (
    ad_id STRING,
    event_ts TIMESTAMP(3),
    WATERMARK FOR event_ts AS event_ts - INTERVAL '10' SECOND
) WITH ('connector' = 'kafka', 'topic' = 'impressions', ...);

CREATE TABLE clicks (
    ad_id STRING,
    event_ts TIMESTAMP(3),
    WATERMARK FOR event_ts AS event_ts - INTERVAL '10' SECOND
) WITH ('connector' = 'kafka', 'topic' = 'clicks', ...);

-- The CTR computation as a single Flink SQL statement
CREATE TABLE ctr_5m AS
SELECT
    ad_id,
    window_start,
    window_end,
    CAST(c.click_count AS DOUBLE) / NULLIF(i.impression_count, 0) AS ctr,
    i.impression_count,
    c.click_count
FROM (
    SELECT ad_id, window_start, window_end, COUNT(*) AS impression_count
    FROM TABLE(HOP(TABLE impressions, DESCRIPTOR(event_ts),
                   INTERVAL '1' MINUTE, INTERVAL '5' MINUTE))
    GROUP BY ad_id, window_start, window_end
) i
LEFT JOIN (
    SELECT ad_id, window_start, window_end, COUNT(*) AS click_count
    FROM TABLE(HOP(TABLE clicks, DESCRIPTOR(event_ts),
                   INTERVAL '1' MINUTE, INTERVAL '5' MINUTE))
    GROUP BY ad_id, window_start, window_end
) c USING (ad_id, window_start, window_end);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The WATERMARK FOR event_ts AS event_ts - INTERVAL '10' SECOND declaration tells Flink that events may arrive up to 10 seconds late. Events arriving later than the watermark are dropped (or, with ALLOWED_LATENESS, side-output).
  2. The HOP window function creates 5-minute sliding windows that advance every 1 minute. Each impression contributes to 5 overlapping windows; same for clicks. State size per ad is O(unique_ads × 5).
  3. The impression and click subqueries each maintain a windowed COUNT in keyed state, partitioned by ad_id. Flink's checkpoint mechanism persists this state every N seconds; recovery restores it from the latest checkpoint.
  4. The outer SELECT divides clicks by impressions with NULLIF(impressions, 0) to avoid division by zero. The LEFT JOIN preserves windows with impressions but no clicks (CTR = 0 / non-zero = 0).
  5. To reprocess this job — say, the windowing was wrong — deploy a new instance pointed at the earliest Kafka offset, materialise to a separate ctr_5m_v2 view, and cut the dashboard over when v2 catches up to v1. This is the side-deploy reprocessing pattern.

Output (one ad after a single 5-minute window).

ad_id window_start window_end impression_count click_count ctr
ad_42 14:00 14:05 1000 30 0.030
ad_42 14:01 14:06 1200 36 0.030
ad_42 14:02 14:07 1100 28 0.025

Rule of thumb. Sliding-window event-time aggregations are Kappa's home turf — exactly the workload Lambda's speed layer was built for, but with exactly-once semantics so the result is canonical rather than approximate. Memorise the WATERMARK + HOP + LEFT JOIN + NULLIF shape; it appears in every streaming-architecture interview.

Worked example — the side-deploy reprocessing pattern

Detailed explanation. A bug in the CTR job double-counts clicks by 2%. Instead of "stopping the job and fixing in place" (which leaves a dirty output view for hours), Kreps's prescribed pattern is to deploy v2 alongside v1, replay from offset 0, and atomically switch reads to v2 when caught up. The dashboard never sees a wrong number; the dirty v1 view is just garbage-collected.

Question. Write a small orchestration script that runs v1 and v2 of the CTR job in parallel, monitors v2's lag against v1, and emits a "switch" event when v2 is within 30 seconds of v1.

Input — both jobs read the same impressions and clicks topics.

Code.

# Side-deploy orchestration
def consumer_lag(group: str, topic: str) -> int:
    """Returns end-of-topic offset minus committed offset for the group."""
    end = kafka_admin.get_end_offsets(topic).values_sum()
    committed = kafka_admin.get_committed_offsets(group, topic).values_sum()
    return end - committed

def wait_until_caught_up(v2_group: str, threshold_sec: int = 30) -> None:
    while True:
        lag_events = consumer_lag(v2_group, "impressions")
        # rough estimate: 1000 events/sec from observed throughput
        est_lag_sec = lag_events / 1000
        print(f"v2 lag = {est_lag_sec:.1f}s")
        if est_lag_sec < threshold_sec:
            return
        time.sleep(60)

def cutover():
    # 1) Deploy v2 with a fresh consumer group, reset to earliest
    deploy_flink_job("ctr_v2", offset="earliest", output_table="ctr_5m_v2")
    # 2) Wait until v2 has replayed history and is caught up
    wait_until_caught_up("ctr_v2")
    # 3) Atomically point the dashboard at the v2 view
    update_dashboard_view("ctr_5m", new_underlying="ctr_5m_v2")
    # 4) Drain v1 and reclaim its state
    stop_flink_job("ctr_v1")
    drop_table("ctr_5m")

cutover()
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Step 1 deploys v2 of the Flink job with a fresh consumer group ID and resets the offset to earliest. v2 immediately starts replaying every event from the beginning of Kafka's retention. v1 keeps running normally, writing to ctr_5m.
  2. Step 2 polls v2's consumer lag every 60 seconds. The lag drops as v2 catches up; the threshold 30s is the "close enough to live" point.
  3. Step 3 flips the dashboard's view — ctr_5m is just a SQL view that points at ctr_5m_v2. Readers see the new numbers atomically. The old ctr_5m_v1 is no longer read.
  4. Step 4 stops v1 and drops its physical table. The Kafka topics are untouched — they are the source of truth, the views are derived.
  5. Why no double-counting during cutover? Both jobs read the same Kafka topics, but each has its own consumer group, so neither blocks the other. The dashboard reads from one view at a time — the atomic flip in Step 3 ensures no overlap.

Output (orchestration log).

t (min) v1 status v2 status v2 lag (sec) dashboard reads
0 running deploying ctr_5m_v1
1 running replaying 86400 ctr_5m_v1
90 running replaying 1200 ctr_5m_v1
92 running caught up 28 ctr_5m_v2
93 stopped running < 30 ctr_5m_v2

Rule of thumb. The side-deploy pattern is the Kappa reprocessing answer in an interview. Memorise the four steps (deploy v2 from earliest, wait for catch-up, flip view, drain v1). It generalises beyond Flink to any stream processor with replayable input — Kafka Streams, Samza, Materialize, even Spark Structured Streaming.

Worked example — Kafka retention sizing for replayability

Detailed explanation. Kappa replay is only possible if Kafka still has the events you want to replay. "How long should we keep events?" is one of the first design questions on a Kappa project. The answer is a function of expected bug-discovery latency, regulatory retention, and storage cost.

Question. A clickstream topic does 100,000 events/sec average, peaks at 500,000 events/sec, with an average event size of 1 KB. Calculate retention sizes for 7 days, 30 days, 90 days, and 365 days. Recommend a retention given a $0.023 / GB / month S3 (tiered storage) cost and an expected bug-discovery window of 21 days.

Code (Python sizing).

EVENTS_PER_SEC_AVG = 100_000
EVENT_SIZE_BYTES   = 1024
S3_COST_PER_GB_MO  = 0.023

def retention_size_gb(days: int) -> float:
    seconds = days * 86_400
    bytes_total = seconds * EVENTS_PER_SEC_AVG * EVENT_SIZE_BYTES
    return bytes_total / (1024 ** 3)

def monthly_cost(days: int) -> float:
    return retention_size_gb(days) * S3_COST_PER_GB_MO

for d in [7, 30, 90, 365]:
    gb = retention_size_gb(d)
    cost = monthly_cost(d)
    print(f"{d:>4}d  size={gb:>9.0f} GB  cost=${cost:>9.0f}/mo")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Compute total bytes per day: 86400 s × 100000 ev/s × 1024 B = 8.85 TB/day. That is the average; peaks can push it 5× during traffic spikes, but average dominates retention sizing.
  2. For 7 days: 62 TB. For 30 days: 265 TB. For 90 days: 796 TB. For 365 days: 3.2 PB. Tiered S3 storage absorbs this; in-broker (local SSD) storage would not.
  3. Cost at $0.023 / GB / month: 7d → $1,460/mo, 30d → $6,250/mo, 90d → $18,800/mo, 365d → $76,000/mo. Multi-region replication doubles the cost.
  4. Bug-discovery window of 21 days is the lower bound on retention — if you discover a bug 21 days after deployment, you must be able to replay 21 days. Pick 30 days to give buffer.
  5. Regulatory retention may add a hard floor (1 year for finance, 7 years for healthcare). If so, replay from a separate archival store (Iceberg, Glacier) rather than from Kafka itself — Kafka replay throughput suffers above 60 days of retention.

Output (retention sizing table).

retention size (GB) monthly cost (S3 tiered)
7 days 60,479 $1,391
30 days 259,200 $5,962
90 days 777,600 $17,885
365 days 3,153,600 $72,533

Rule of thumb. Size Kafka retention at max(bug_discovery_window, regulatory_minimum) × 1.5. Everything older lives in Iceberg / Delta on cheap S3 — at which point you have re-invented the batch layer, but with one storage layer and one query language. That is the Lakehouse pattern in section 5.

Data engineering interview question on Kappa's reprocessing model

A senior interviewer often opens with: "Walk me through how you would re-run an entire year of CTR computation in Kappa after discovering a bug in the windowing logic. Specifically address Kafka retention, replay throughput, view cutover, and the dashboard SLA during the migration." It probes whether the candidate has built reprocessing in anger or only read about it.

Solution Using a side-deploy with offset reset and atomic view cutover

# End-to-end reprocessing plan for a 1-year window
def reprocess_year(job_name: str, output_table: str, ttl_days: int = 365):
    # 1) Check Kafka retention covers the window
    earliest = kafka_admin.earliest_timestamp("impressions")
    if (now() - earliest).days < ttl_days:
        # 1a) Fallback — replay from Iceberg archive instead
        return reprocess_from_iceberg("iceberg.events.impressions",
                                       lookback_days=ttl_days)

    # 2) Deploy v2 of the job with offset=earliest, new consumer group, new sink
    deploy_flink_job(
        job_name=f"{job_name}_v2",
        offset="earliest",
        output_table=f"{output_table}_v2",
        consumer_group=f"{job_name}_v2_grp",
    )

    # 3) Monitor v2 catch-up; emit periodic progress
    while True:
        lag = consumer_lag(f"{job_name}_v2_grp", "impressions")
        if lag < 1_000_000:  # within ~1 minute at 1M ev/s
            break
        log("v2 lag", events=lag, est_seconds=lag/1_000_000)
        sleep(60)

    # 4) Atomic view cutover — dashboard sees new numbers from this moment
    swap_view(view=output_table, points_at=f"{output_table}_v2")

    # 5) Cooldown — keep v1 running for 1 hour as fallback
    sleep(3600)
    stop_flink_job(f"{job_name}_v1")
    drop_table(f"{output_table}_v1")

reprocess_year("ctr_job", "ctr_5m", ttl_days=365)
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step what happens duration
1 check Kafka retention covers 365 days 1 minute
1a (if not) fall back to Iceberg archive 2 hours start-up
2 deploy v2 from earliest offset 5 minutes deploy
3 v2 replays 365 days of events at 5× live throughput 73 days / 5 = ~15 days, parallelised 30 ways → ~12 hours
4 atomic view swap; dashboard sees new numbers < 1 second
5 hour-long cooldown before draining v1 1 hour

The plan exposes the real cost of Kappa reprocessing: replay throughput. 365 days of events replayed at 5× live (assuming the new job is twice as fast as the old and runs on 2× the parallelism) takes ~12 hours of wall-clock time. The dashboard SLA never breaks because the cutover is atomic.

Output:

metric value
total wall-clock for reprocessing ~13 hours
dashboard downtime 0 seconds (atomic view swap)
v1 runs in parallel yes, until step 5
Kafka retention required 365 days OR Iceberg fallback
extra infra during reprocess 2× CPU / memory of the steady-state job

Why this works — concept by concept:

  • Side-deploy with fresh consumer group — v1 and v2 read the same Kafka topic without interfering because each has its own committed offset. The dashboard reads from exactly one view at a time, never both. This is the load-bearing invariant of Kappa reprocessing.
  • Offset reset to earliest — Kafka's auto.offset.reset = earliest plus a fresh consumer group makes "replay from the beginning" a one-line config, not a custom backfill job. Same code path as steady-state; same SQL; same semantics.
  • Atomic view swap — the dashboard queries a view, not the underlying table. Swapping the view's underlying table is an O(1) catalog operation in Iceberg, Snowflake, BigQuery, and most modern engines. No downtime.
  • Iceberg fallback for retention misses — if Kafka retention is shorter than the reprocess window, Iceberg / Delta archive is the only path. This is the moment a "pure Kappa" architecture starts looking like a Lakehouse.
  • Cooldown before drain — keeping v1 alive for an hour after cutover gives a rollback option if v2 emits a regression. Cheap insurance; cost is one hour of duplicated compute.
  • Cost — replay throughput is O(events × ratio_of_replay_to_live_throughput); typical ratio is 3-10×, parallelised by Kafka partition count. Big-O: O(N / partitions × replay_ratio).

SQL
Topic — real-time analytics
Real-time analytics problems (SQL)

Practice →


4. Reprocessing patterns — replay window, backfill, state restoration

Reprocessing in Kappa is not one pattern — it is four, and the right one depends on the size of the bug and the size of the history

The mental model in one line: reprocessing is "throw away the bad view, rebuild it from the log" — but the log boundary you replay from determines whether it takes 5 minutes or 5 days. Once you can name the four patterns — replay window, full backfill, bootstrap-and-tail, state restoration — the entire streaming reprocessing interview surface collapses to "which one does this scenario need?"

Pattern 1 — bounded replay window.

  • Shape. Reset the consumer group to a known offset (or timestamp), replay forward, stop when caught up. Usually used for "fix the last 7 days" or "re-emit the last 24 hours of late events."
  • Cost. O(window × event rate / partitions × replay throughput). Typically minutes to a few hours.
  • When to use. Bug fixes with a known impact window. Late-arriving events that exceeded the original watermark. Backfilling a new field for the most recent slice.

Pattern 2 — full backfill from earliest offset.

  • Shape. Deploy v2 with auto.offset.reset = earliest, side-deploy the output view, cut over when caught up. The textbook Kappa reprocessing pattern (section 3's worked example).
  • Cost. O(total Kafka retention × event rate / partitions × replay throughput). Can be days for multi-year retention.
  • When to use. Algorithm changes, schema migrations, anything that affects every record's interpretation. Requires Kafka retention >= the reprocess window — or an external archive.

Pattern 3 — bootstrap with batch + tail with stream.

  • Shape. Run a one-time batch job (Spark on Iceberg / Delta) to materialise the historical view; switch to streaming for the tail. The hybrid that looks like Lambda but with one storage layer.
  • Cost. O(history × batch throughput) for the one-time + O(tail rate) for steady state. Usually orders of magnitude cheaper than full replay.
  • When to use. Multi-year history where Kafka replay would be prohibitively expensive. Iceberg's batch read throughput is 10-50× Kafka replay throughput for the same data.

Pattern 4 — state restoration from checkpoint.

  • Shape. Flink / Kafka Streams persists keyed state to durable storage (RocksDB + S3). On crash or upgrade, the state is restored from the last successful checkpoint; the Kafka offset is rewound to the checkpoint boundary; processing resumes.
  • Cost. O(state size / restore throughput) — minutes for typical state sizes (GB-scale).
  • When to use. Operational reprocessing — engine restart, upgrade, scaling. Not for bug fixes; the state replays the old code's behaviour.

Dual-write and the outbox pattern.

  • The dual-write problem. A service writes to its own DB and also to Kafka. Either write can fail independently, leaving DB and Kafka divergent. Reprocessing assumes the Kafka log is canonical — but if the DB has rows Kafka doesn't, replay produces wrong numbers.
  • The outbox pattern. Writes go to a single DB transaction that touches both the business table and an outbox table. A separate CDC process (Debezium, Flink CDC) tails the DB log and emits to Kafka. Atomic by construction.
  • Why it matters for Kappa. Without the outbox, Kappa's "log is the source of truth" invariant is a polite lie. With the outbox, reprocessing genuinely replays the truth.

Exactly-once during replay.

  • Idempotent sinks. The output writer must be safe to re-execute. Upserts on a primary key are the canonical idempotent sink; append-only sinks (a downstream Kafka topic) need de-duplication on the consumer.
  • Transactional producers. Kafka 0.11+ supports producer-level transactions: the offset commit and the output produce happen atomically. Combined with Flink's two-phase commit sink, this gives end-to-end exactly-once.
  • Deterministic UDFs. Any user-defined function in the stream job must be deterministic — same input always yields same output. now(), random(), or external API calls break replay determinism.

Worked example — bounded replay for a 7-day bug

Detailed explanation. A clickstream pipeline has a bug: a CASE expression mis-bucketed mobile traffic for the last 7 days. The fix is a one-line code change. Instead of a full backfill (expensive), use a bounded replay: reset the consumer group to "7 days ago," redeploy v2, let it overwrite the affected partitions, drain v1.

Question. Write the offset-reset commands and orchestration for a bounded 7-day reprocess. Estimate wall-clock time given 100k events/sec and 24 Kafka partitions.

Code (orchestration).

SEVEN_DAYS_AGO_MS = int((now() - timedelta(days=7)).timestamp() * 1000)

# 1) Build a new consumer group at the 7-days-ago offset
def offsets_at_ts(topic: str, ts_ms: int) -> dict[int, int]:
    """For each partition, find the smallest offset whose ts >= ts_ms."""
    return {p: kafka_admin.offset_for_timestamp(topic, p, ts_ms)
            for p in kafka_admin.list_partitions(topic)}

bounded_offsets = offsets_at_ts("clickstream", SEVEN_DAYS_AGO_MS)
kafka_admin.commit_offsets("ctr_v2_grp", "clickstream", bounded_offsets)

# 2) Deploy v2 of the job; it picks up at the bounded offsets
deploy_flink_job(
    name="ctr_v2",
    consumer_group="ctr_v2_grp",
    output_table="ctr_5m_v2",
    bootstrap_from_committed=True,
)

# 3) Wait for v2 to reach the live tail
wait_until_caught_up("ctr_v2_grp", threshold_sec=30)

# 4) Atomic view swap; drain v1
swap_view(view="ctr_5m", points_at="ctr_5m_v2")
sleep(3600)  # cooldown
stop_flink_job("ctr_v1")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. offsets_at_ts calls Kafka's offsetsForTimes API to translate a wall-clock timestamp to a per-partition offset. Each partition returns the smallest offset whose message timestamp is >= the target.
  2. Committing those offsets to the new consumer group ctr_v2_grp means v2's first poll starts at "7 days ago" on every partition. No code change required in the Flink job.
  3. v2 replays 7 days of events. At 100k ev/s × 86400 s × 7 d = 60.5 billion events. Across 24 partitions at 5× live throughput per partition, replay finishes in 7 days / 5 = 1.4 days — about 33 hours.
  4. The atomic view swap is identical to the full-backfill pattern. The cooldown gives a rollback window.
  5. The key insight: bounded replay is not a different pipeline from full backfill — it is the same pipeline with a different starting offset. This is why Kappa's reprocessing model is genuinely simpler than Lambda's.

Output (reprocess summary).

metric value
events to replay 60.5 billion
partitions 24
events/partition 2.5 billion
replay throughput per partition 500k ev/s
wall-clock 2.5B / 500k = 5000s = 1.4 hours per partition (parallel)
total wall-clock ~33 hours (sequential within partition; partitions run in parallel)

Rule of thumb. Bounded replay is the cheap Kappa reprocessing pattern. Estimate wall-clock as events_per_partition / replay_throughput_per_partition. If that number is > 24 hours, consider bootstrap-with-batch instead — Iceberg scan is faster than Kafka replay.

Worked example — outbox pattern for atomic DB + Kafka writes

Detailed explanation. A microservice creates an Order row in its OLTP database and emits an OrderCreated event to Kafka. Naive "write DB, then publish to Kafka" leaves a window where DB succeeded but Kafka failed — or vice versa. The outbox pattern makes both writes atomic by deferring the Kafka publish.

Question. Write the outbox pattern as a single DB transaction plus a CDC tail. Show the schema and the consumer side.

Schema.

-- The business table
CREATE TABLE orders (
    order_id    BIGSERIAL PRIMARY KEY,
    user_id     BIGINT,
    amount      NUMERIC(18, 2),
    created_at  TIMESTAMPTZ DEFAULT now()
);

-- The outbox table — append-only, drains to Kafka
CREATE TABLE outbox (
    outbox_id     BIGSERIAL PRIMARY KEY,
    aggregate_id  BIGINT,        -- the order_id
    event_type    TEXT,
    payload       JSONB,
    created_at    TIMESTAMPTZ DEFAULT now(),
    published_at  TIMESTAMPTZ NULL  -- set by the CDC tail after Kafka ack
);
Enter fullscreen mode Exit fullscreen mode

Code (service-side transaction).

def create_order(user_id: int, amount: float) -> int:
    with db.transaction():
        row = db.insert("orders", user_id=user_id, amount=amount)
        db.insert("outbox",
                  aggregate_id=row.order_id,
                  event_type="OrderCreated",
                  payload={"order_id": row.order_id,
                           "user_id": user_id,
                           "amount": amount})
    return row.order_id  # transaction commit guarantees both writes succeed atomically

# CDC tail — separate process, often Debezium
def cdc_tail():
    for change in debezium.tail("public.outbox"):
        if change.op == "INSERT":
            kafka.produce("orders", key=change.aggregate_id, value=change.payload)
            db.update("outbox", outbox_id=change.outbox_id, published_at=now())
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The service writes both orders and outbox in a single DB transaction. Either both rows commit or neither does — atomicity is enforced by the DB, not the application.
  2. A separate CDC process (Debezium tailing the Postgres WAL, or Flink CDC) reads the outbox table's changes and produces them to Kafka. The producer is idempotent (de-duplication by outbox_id).
  3. After successful Kafka produce, the CDC process marks the outbox row as published. This bookkeeping lets us monitor lag (SELECT COUNT(*) FROM outbox WHERE published_at IS NULL).
  4. Why this matters for Kappa reprocessing. When v2 replays Kafka from the beginning, it sees exactly the events that actually happened in the DB — no missing events from a half-failed dual-write, no extra events from a retried publish. The log is genuinely the source of truth.
  5. The pattern composes with sagas / event sourcing: the outbox is the durability boundary for every distributed-system pattern that needs "DB write + downstream emit" atomicity.

Output (DB state after one order, before CDC drains).

table row published_at
orders order_id=1, user=42, amount=99.50 n/a
outbox outbox_id=1, aggregate_id=1, event_type=OrderCreated, payload={...} NULL

Output (after CDC drains).

table row published_at
outbox outbox_id=1, ... 2026-06-10 14:33:01

Rule of thumb. Every "DB write plus Kafka emit" pattern in a Kappa world should use the outbox. The cost is one extra table and one extra index; the benefit is genuine "log is the source of truth" semantics that hold under reprocessing.

Worked example — state restoration on Flink upgrade

Detailed explanation. A Flink job holds 50 GB of keyed state for a sliding-window aggregation. The team needs to upgrade the Flink minor version, which requires a stop-restart of the job. Naive restart loses the state and triggers a full replay; savepoint-and-restore preserves the state and resumes from the exact offset.

Question. Walk through the savepoint, upgrade, restore sequence for a stateful Flink job. Calculate the difference between savepoint-restore and full-replay wall-clock.

Code (operator).

# 1) Trigger a savepoint — Flink stops the job after writing state to S3
flink savepoint <job-id> s3://flink-savepoints/ctr/2026-06-10/

# 2) Upgrade Flink runtime
helm upgrade flink-cluster flink/flink --version 1.21.0

# 3) Resume the job from the savepoint
flink run -s s3://flink-savepoints/ctr/2026-06-10/ \
    ctr-job.jar --config job.conf

# 4) Confirm offset and state are restored
flink list  # should show the same job-id with the restored consumer group offsets
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. flink savepoint triggers a global checkpoint: every operator flushes its keyed state to S3, the consumer group's offsets are committed at the checkpoint boundary, and the job stops cleanly.
  2. The Flink runtime upgrade can now happen safely — there is no live job to disrupt. The savepoint is a self-contained snapshot.
  3. flink run -s <savepoint> starts the new job, loads the operator state from S3 into the new pods' RocksDB, and resumes consumption from the exact Kafka offsets recorded in the savepoint.
  4. The state is bit-for-bit identical to what it was at savepoint time. Sliding-window aggregations continue exactly where they left off — no missed events, no double-counted events.
  5. Wall-clock comparison. Savepoint+restore: state size 50 GB / restore throughput ~500 MB/s = 100 seconds. Full replay: 7 days of events at 100k ev/s replayed at 500k ev/s = 28 hours. Savepoint is ~1000× faster.

Output (wall-clock comparison).

approach wall-clock events re-processed data consistency
savepoint + restore ~2 minutes 0 (state restored exactly) exactly-once preserved
full replay from offset 0 ~28 hours 60+ billion exactly-once preserved (if EOS)

Rule of thumb. Always use savepoint+restore for operational reprocessing (upgrades, scaling, infra moves). Reserve full-replay for semantic reprocessing (bug fixes, schema changes, algorithm changes). The two patterns have wildly different costs and operational risk profiles.

Data engineering interview question on choosing the right reprocessing pattern

A senior interviewer often frames this as: "You discover a clickstream bug affecting the last 30 days. The Kafka retention is 7 days; the Iceberg archive has 5 years. Walk me through how you'd reprocess." It blends pattern selection, the retention boundary, and the bootstrap-and-tail hybrid.

Solution Using bootstrap-from-Iceberg + tail-from-Kafka

def reprocess_with_bootstrap_and_tail(
    iceberg_table: str,
    kafka_topic: str,
    fix_window_days: int,
    output_view: str,
) -> None:
    # 1) Define the reprocess window
    fix_start = now() - timedelta(days=fix_window_days)

    # 2) Bootstrap — Spark job over Iceberg time-travel to fix_start
    bootstrap_sql = f"""
        INSERT OVERWRITE {output_view}_v2
        SELECT ad_id,
               window_start,
               COUNT(*) FILTER (WHERE event_type='click') AS clicks,
               COUNT(*) FILTER (WHERE event_type='impression') AS impressions
        FROM {iceberg_table}
        WHERE event_ts >= TIMESTAMP '{fix_start}'
          AND event_ts <  CURRENT_TIMESTAMP - INTERVAL '7 days'  -- end where Kafka begins
        GROUP BY ad_id, window_start
    """
    spark.sql(bootstrap_sql)

    # 3) Tail — Flink stream from earliest Kafka offset (which is 7 days ago)
    deploy_flink_job(
        name="ctr_v2_tail",
        offset="earliest",  # earliest in Kafka == 7 days ago
        output_table=f"{output_view}_v2",
        write_mode="upsert",  # merges with bootstrap output
    )

    # 4) Wait for tail to catch up
    wait_until_caught_up("ctr_v2_tail_grp", threshold_sec=30)

    # 5) Atomic view swap
    swap_view(view=output_view, points_at=f"{output_view}_v2")
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

step engine window wall-clock
1 last 30 days defined < 1s
2 Spark on Iceberg 30 days back → 7 days back 12 minutes (Iceberg fast scan)
3 Flink on Kafka last 7 days → now 1.5 hours (replay at 5× live)
4 waiting for catch-up 1.5 hours (overlap with step 3)
5 atomic view swap < 1 second

The hybrid sidesteps Kafka's 7-day retention by reading the historical slice from Iceberg (which has 5 years) and the recent slice from Kafka (which is faster for live tail). Upsert semantics in the output mean the two writers cannot create duplicates on the boundary.

Output:

metric value
total wall-clock ~1.5 hours (bootstrap and tail overlap)
events processed via Iceberg 23 days × 100k ev/s × 86400 = ~199 B
events processed via Kafka 7 days × 100k ev/s × 86400 = ~60 B
data consistency on boundary exactly-once via upsert + watermark

Why this works — concept by concept:

  • Bootstrap with Spark on Iceberg — Iceberg's columnar Parquet + file pruning gives 10-50× the scan throughput of Kafka replay for the same data. The historical slice is cheaper to read from object storage than from Kafka, especially with partition / sort pruning.
  • Tail with Flink on Kafka — the live tail (last 7 days) is exactly the workload Kafka was built for: low-latency, append-only, partition-parallel. Flink reads from earliest (which is 7 days ago because retention is 7 days).
  • Boundary safety via upsert — both writers target the same output table. Spark writes the bootstrap rows with primary key (ad_id, window_start); Flink's upsert sink overwrites the same keys for the tail window. No double-counting on the boundary.
  • Watermark coordination — the bootstrap query ends at now() - 7 days; the tail starts at earliest (= now - 7 days). The single-second seam is covered by the upsert semantics — late events on the boundary land in whichever writer is still active.
  • One storage layer — Iceberg / Delta lets the same table be read by Spark and written by Flink with ACID guarantees. The bootstrap-and-tail pattern requires this — it does not work with separate batch and streaming storage.
  • Cost — bootstrap is O(history × Iceberg scan rate); tail is O(retention × Kafka replay rate). Total wall-clock is max(bootstrap, tail) because they run in parallel. Big-O: O(N) but with a 50× constant factor improvement over pure Kafka replay.

SQL
Topic — event modeling
Event modeling problems (SQL)

Practice →


5. 2026 reality — Lakehouse + Streaming SQL collapses both

Iceberg, Delta, and Streaming SQL make Lambda and Kappa both legacy for new builds

The mental model in one line: the modern Lakehouse stores both batch and streaming writes in the same ACID tables; Streaming SQL engines compile one query to either a batch plan or a streaming plan; the duplicate-code tax disappears. Once you can sketch the Lakehouse + Streaming SQL stack on a whiteboard, every Lambda-vs-Kappa interview probe collapses to "neither — here is the 2026 default."

Visual diagram of the 2026 Lakehouse + Streaming SQL architecture — Kafka feeding Flink CDC into Iceberg/Delta tables on S3 in the middle, with Materialize/RisingWave/Flink Streaming SQL on top reading the same tables and serving sub-second views; an annotation that the same SQL query compiles to batch or streaming plans; on a light PipeCode card.

The three layers of the Lakehouse-native stack.

  • Ingestion layer. Kafka as the event log; Flink CDC / Debezium as the database change feed; Flink or Spark Structured Streaming as the ingestion writer. The writer writes to Iceberg or Delta tables with exactly-once semantics.
  • Storage layer. Iceberg or Delta on S3 — ACID tables with time travel, schema evolution, partition evolution, and file-level metadata. The same table is readable by Spark batch jobs, Flink streaming jobs, Trino interactive queries, and Streaming SQL engines.
  • Query surface. One SQL dialect across all engines. Materialize, RisingWave, Flink SQL, and increasingly Snowflake / Databricks Streaming SQL compile the same query to either a batch plan (read once, return result) or a streaming plan (incremental view maintenance, sub-second updates).

The five inventions that made Lambda obsolete.

  • Iceberg and Delta — ACID tables on S3 finally gave the industry the "immutable master dataset with cheap point-in-time queries" that Lambda's 2011 design assumed.
  • Kafka transactions (KIP-98) — exactly-once semantics across the produce → process → produce → process chain; the speed layer is no longer "approximate."
  • Flink + Spark Structured Streaming maturity — exactly-once stream processors with savepoint+restore make streaming as durable as batch.
  • Materialize and RisingWave — true Streaming SQL with incremental view maintenance, written for analyst-readable SQL not Java-DAG topologies.
  • dbt + dlt + ELT-first stacks — the team-process change that lets a single SQL artifact live in version control and ship to either batch or streaming.

One query, two plans — the punch-line.

-- A single SQL definition
CREATE MATERIALIZED VIEW ctr_5m AS
SELECT
    ad_id,
    DATE_TRUNC('minute', event_ts) AS minute,
    SUM(CASE WHEN event_type = 'click' THEN 1 ELSE 0 END) AS clicks,
    SUM(CASE WHEN event_type = 'impression' THEN 1 ELSE 0 END) AS imps,
    SUM(CASE WHEN event_type = 'click' THEN 1 ELSE 0 END) * 1.0
        / NULLIF(SUM(CASE WHEN event_type = 'impression' THEN 1 ELSE 0 END), 0) AS ctr
FROM events
WHERE event_ts >= now() - INTERVAL '5 minutes'
GROUP BY ad_id, minute;

-- On Materialize / RisingWave → compiles to incremental dataflow,
-- sub-second updates as new events arrive in the underlying table.

-- On Spark / Snowflake batch → compiles to a one-shot scan,
-- runs on schedule (every 5 minutes), refreshes the materialised view.

-- Same SQL. Same semantics. Two execution modes.
Enter fullscreen mode Exit fullscreen mode

Where the Lakehouse + Streaming SQL stack wins.

  • New builds. No legacy to migrate; pick the modern stack and skip the Lambda detour entirely.
  • Mixed batch + stream workloads. A team that needs both nightly cohort reports and sub-second dashboards can write both in one SQL dialect with one storage layer.
  • dbt-friendly teams. Analytics engineers can ship streaming pipelines using the same tooling as batch pipelines — dbt build works on a Materialize or RisingWave source the same way it works on Snowflake.
  • Schema evolution. Iceberg's schema evolution + time travel let you reshape tables without breaking downstream readers — neither Lambda nor pure Kappa offered this cleanly.

Where the Lakehouse + Streaming SQL stack still has rough edges in 2026.

  • Sub-millisecond serving. Materialize / RisingWave land at single-digit-millisecond latency; sub-millisecond serving (ad auction, HFT) still needs an in-memory KV store as the final hop.
  • Complex stateful joins. Stream-stream joins on multi-terabyte state sizes still favour Flink directly over Streaming SQL — though RisingWave is closing the gap fast.
  • Multi-region active-active. Iceberg is regionalised; cross-region Streaming SQL is still a roll-your-own exercise.
  • dbt-streaming integration. dbt's incremental materialisation model maps cleanly to streaming SQL but the tooling polish lags batch dbt by 12-18 months.

Migration patterns from Lambda to Lakehouse.

  • Strangle the speed layer first. Replace the Flink / Storm speed layer with a Materialize / RisingWave view over Iceberg. Iceberg is fed by both the old batch ingestion and a new Flink CDC ingestion. The merge query disappears because the view is the merge.
  • Then strangle the batch layer. Replace Spark batch jobs with Streaming SQL views over the same Iceberg tables. The batch schedule becomes the view refresh interval; the view definition is the metric.
  • Drop the serving layer. HBase / Cassandra serving can be replaced by direct queries against the materialised view in Materialize / RisingWave, or by a thin caching layer if sub-millisecond latency is required.
  • Retire the merge query. The COALESCE / UNION-ALL merge query disappears because there is only one view. No drift; no cutoff timestamp; no double-count risk.

Worked example — Iceberg as the unified master dataset

Detailed explanation. A team replaces the HDFS master dataset of their Lambda stack with Iceberg on S3. Both Spark batch jobs and Flink streaming jobs now read from and write to the same table. Time-travel queries replace "rerun the batch job and check the diff" debugging workflows.

Question. Write the Iceberg DDL for the events master dataset and show the batch and streaming reads using the same table reference.

Code (Iceberg DDL).

CREATE TABLE iceberg.warehouse.events (
    event_id    BIGINT,
    user_id     BIGINT,
    event_type  STRING,
    event_ts    TIMESTAMP,
    payload     STRING
)
PARTITIONED BY (days(event_ts), bucket(16, user_id))
TBLPROPERTIES (
    'format-version' = '2',
    'write.format.default' = 'parquet',
    'write.upsert.enabled' = 'true',
    'commit.retry.num-retries' = '5'
);
Enter fullscreen mode Exit fullscreen mode

Code (batch read — Spark).

# Spark — batch query, point-in-time time travel
df = spark.read.option("as-of-timestamp", "2026-06-01 00:00:00")\
                .table("iceberg.warehouse.events")
agg = df.groupBy("user_id").count()
agg.write.mode("overwrite").saveAsTable("iceberg.warehouse.events_per_user_d")
Enter fullscreen mode Exit fullscreen mode

Code (streaming read — Flink).

-- Flink SQL — streaming read, latest snapshot
CREATE TABLE events (...) WITH (
    'connector' = 'iceberg',
    'catalog-name' = 'warehouse',
    'table' = 'events',
    'streaming' = 'true',
    'start-snapshot-id' = '...'
);

CREATE TABLE events_per_user_5m AS
SELECT user_id,
       TUMBLE_START(event_ts, INTERVAL '5' MINUTE) AS bucket,
       COUNT(*) AS c
FROM events
GROUP BY user_id, TUMBLE(event_ts, INTERVAL '5' MINUTE);
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The Iceberg table is partitioned by day of event_ts (cheap day-level pruning) and bucketed by user_id (uniform parallelism for joins). The format-version = '2' plus write.upsert.enabled = 'true' enables row-level deletes — critical for GDPR and CDC ingestion.
  2. The Spark batch read uses Iceberg's time-travel feature: as-of-timestamp returns the snapshot of the table as of that exact moment. Useful for "rerun the report as it would have looked on June 1."
  3. The Flink streaming read tails the same table as a stream — every new Iceberg snapshot is emitted as a delta to the downstream operators. State is keyed by user_id; tumbling windows aggregate every 5 minutes.
  4. Both jobs see identical data because they read from the same physical files in the same Iceberg snapshot graph. No "batch view vs speed view" — just one table, two reading modes.
  5. The duplicate-code tax has not disappeared yet — there is still Spark code and Flink code. But the data is unified. The next step (Streaming SQL) collapses the code too.

Output (Spark batch result for one day).

user_id count
u1 14
u2 8

Output (Flink streaming result, sample 5-minute window).

user_id bucket c
u1 14:00 2
u2 14:00 1

Rule of thumb. Iceberg / Delta is the first migration step off Lambda. Even if you keep both Spark and Flink, unifying storage on Iceberg removes the dual-write problem and makes time-travel debugging trivial. Plan this migration before touching the compute layer.

Worked example — one SQL, two execution modes

Detailed explanation. A CTR metric is defined once as a Streaming SQL materialised view. Materialize compiles it to an incremental dataflow with sub-second update latency. The same SQL, run against the same Iceberg table in Snowflake, compiles to a batch query that runs on a 5-minute schedule. One artifact, two outputs.

Question. Show the single SQL definition and the explain plans for the streaming (Materialize) and batch (Snowflake) execution modes.

Code (single SQL).

-- Single definition; lives in version control as ctr_5m.sql
CREATE MATERIALIZED VIEW ctr_5m AS
SELECT
    ad_id,
    DATE_TRUNC('minute', event_ts) AS minute,
    SUM(CASE WHEN event_type = 'click' THEN 1 ELSE 0 END) AS clicks,
    SUM(CASE WHEN event_type = 'impression' THEN 1 ELSE 0 END) AS imps,
    SUM(CASE WHEN event_type = 'click' THEN 1 ELSE 0 END) * 1.0
        / NULLIF(SUM(CASE WHEN event_type = 'impression' THEN 1 ELSE 0 END), 0)
        AS ctr
FROM iceberg.warehouse.events
WHERE event_ts >= now() - INTERVAL '5 minutes'
GROUP BY ad_id, minute;
Enter fullscreen mode Exit fullscreen mode

Code (Materialize execution — incremental dataflow).

EXPLAIN OPTIMIZED PLAN FOR MATERIALIZED VIEW ctr_5m;
-- Result: a Differential Dataflow plan
--   Reduce (group-by ad_id, minute)
--     Map (event_ts → minute)
--       Filter (event_ts >= now - 5m)
--         Source (iceberg.events, streaming snapshot tail)
-- Updates incrementally as new Iceberg snapshots arrive.
-- Steady-state latency: < 100ms from event to view update.
Enter fullscreen mode Exit fullscreen mode

Code (Snowflake execution — batch).

EXPLAIN PLAN FOR CREATE MATERIALIZED VIEW ctr_5m AS ...;
-- Result: a typical Snowflake batch plan
--   Aggregate (GROUP BY ad_id, minute)
--     Filter (event_ts >= now - 5m)
--       TableScan (iceberg.events, latest snapshot, partition pruning)
-- Refreshed on a schedule (every 5 minutes).
-- Steady-state latency: 5 minutes (refresh interval).
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The SQL is literally identical. The same .sql file in dbt or sqlmesh can be deployed to either engine. The semantics — group-by, filter, NULLIF, COALESCE — are ANSI-standard and behave the same way on every Streaming SQL engine.
  2. Materialize compiles the query to a Differential Dataflow plan that incrementally updates as new rows arrive in the underlying Iceberg table. State is held per-key in RocksDB; updates are O(arrival_rate).
  3. Snowflake compiles the same query to a conventional batch plan that runs on a refresh schedule. State is materialised to a real table; refresh cost is O(scan rate).
  4. The choice between the two modes is operational, not semantic. If you need sub-second freshness for a live dashboard, use Materialize. If 5-minute freshness is fine and your team already has Snowflake, use Snowflake.
  5. There is no longer a "batch logic" and a "streaming logic" written by two different teams. There is one SQL, and the execution mode is a deployment-time choice.

Output (latency comparison).

engine mode end-to-end latency infra cost
Materialize / RisingWave incremental dataflow < 100ms dedicated cluster
Flink SQL streaming plan 1-5 sec dedicated cluster
Snowflake / Databricks batch scheduled scan 5 min (= refresh interval) shared warehouse

Rule of thumb. Define the metric once in SQL, choose the execution mode at deployment time, and use Iceberg / Delta as the storage substrate. This is the architecture that makes the Lambda-vs-Kappa interview question obsolete — the answer is "neither, this."

Worked example — strangler migration from Lambda to Lakehouse

Detailed explanation. A team owns a 6-year-old Lambda pipeline with Spark batch, Flink speed, and Cassandra serving. They want to migrate to Iceberg + RisingWave without a flag day. The strangler pattern routes a slice of read traffic to the new stack, validates parity, and increases the slice until the old stack carries no traffic.

Question. Sketch a 4-phase strangler migration plan with parity checks at each phase.

Code (phase plan as data).

MIGRATION_PHASES = [
    {
        "phase": 1,
        "action": "Add Iceberg ingestion alongside HDFS",
        "writes_to": ["hdfs_master", "iceberg_events"],
        "reads_from": "cassandra (Lambda merged view)",
        "parity_check": "row counts match between hdfs and iceberg daily",
        "duration_weeks": 4,
    },
    {
        "phase": 2,
        "action": "Run RisingWave view over Iceberg; shadow read 1%",
        "writes_to": ["hdfs_master", "iceberg_events"],
        "reads_from": "99% cassandra + 1% risingwave",
        "parity_check": "RisingWave numbers match Cassandra within 0.1% for 7 days",
        "duration_weeks": 6,
    },
    {
        "phase": 3,
        "action": "Cut over read traffic to RisingWave",
        "writes_to": ["hdfs_master", "iceberg_events"],
        "reads_from": "100% risingwave; cassandra is read-only fallback",
        "parity_check": "no rollback for 30 days",
        "duration_weeks": 4,
    },
    {
        "phase": 4,
        "action": "Retire HDFS, Spark batch, Flink speed, Cassandra",
        "writes_to": ["iceberg_events"],
        "reads_from": "risingwave",
        "parity_check": "decom retro confirms zero downstream dependencies",
        "duration_weeks": 8,
    },
]

def total_migration_weeks() -> int:
    return sum(p["duration_weeks"] for p in MIGRATION_PHASES)

print(f"Total migration: {total_migration_weeks()} weeks = {total_migration_weeks()/4:.1f} months")
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Phase 1 (4 weeks) — dual-write to both HDFS and Iceberg. The old Lambda stack still owns reads. A daily parity check confirms the new ingestion is lossless. Cheap and reversible.
  2. Phase 2 (6 weeks) — stand up RisingWave over Iceberg with the metric defined as a materialised view. Shadow 1% of read traffic to RisingWave; compare every response against Cassandra's response. Drift > 0.1% blocks promotion to phase 3.
  3. Phase 3 (4 weeks) — flip read traffic to 100% RisingWave. Cassandra stays as a read-only fallback for 30 days. Any production incident in the new stack is one config change away from rollback.
  4. Phase 4 (8 weeks) — confirm no downstream readers depend on Cassandra / Spark batch / Flink speed; retire each in sequence. The HDFS master dataset is the last to go, after a decom retro.
  5. Total: 22 weeks (~5.5 months) for a 6-year-old Lambda stack. The strangler pattern trades calendar time for near-zero risk — every phase is reversible, and the old stack is alive until the new stack has proven itself.

Output (phase summary).

phase action duration reversibility
1 dual-write to HDFS + Iceberg 4 weeks trivially reversible
2 shadow read 1% from RisingWave 6 weeks trivially reversible
3 cut over to 100% RisingWave 4 weeks one config change to roll back
4 retire HDFS / Spark / Flink / Cassandra 8 weeks irreversible after decom

Rule of thumb. Never do a flag-day migration off a Lambda pipeline. The strangler pattern with shadow reads + parity checks is the only safe path. Budget 5-6 months for a 5+-year-old pipeline; budget 2-3 months for a 1-2-year-old one.

Data engineering interview question on the 2026 architecture choice

A senior interviewer often closes with: "For a brand-new clickstream pipeline at a Series B startup — 100k events/sec, sub-second dashboards required, single 4-person data team — would you build Lambda, Kappa, or Lakehouse + Streaming SQL? Defend the choice with three concrete reasons." It probes whether the candidate can apply the framework, not just describe it.

Solution Using a constraint-by-constraint matching of the workload to Lakehouse + Streaming SQL

def architecture_recommendation(profile: dict) -> dict:
    score = {"lambda": 0, "kappa": 0, "lakehouse_streaming_sql": 0}
    reasons = []

    if profile["latency_sla_seconds"] < 1:
        score["kappa"] += 1
        score["lakehouse_streaming_sql"] += 2
        reasons.append("sub-second SLA → streaming SQL on Lakehouse is fastest dev")

    if profile["team_size"] <= 4:
        score["lakehouse_streaming_sql"] += 2
        score["kappa"] += 1
        reasons.append("small team → avoid duplicate-code tax (lambda penalised)")

    if profile["build_age_years"] == 0:
        score["lakehouse_streaming_sql"] += 2
        reasons.append("new build → no legacy to migrate; skip Lambda entirely")

    if profile["mixed_batch_and_stream"]:
        score["lakehouse_streaming_sql"] += 2
        score["lambda"] += 1
        reasons.append("mixed workload → one SQL surface beats two engines")

    if profile["regulated_batch_required"]:
        score["lambda"] += 2
        reasons.append("regulated batch → Lambda's separate batch layer is still useful")

    if profile["kafka_retention_days"] >= 365:
        score["kappa"] += 1
        score["lakehouse_streaming_sql"] += 1
        reasons.append("long retention → replay-based reprocessing is viable")

    winner = max(score, key=score.get)
    return {"verdict": winner, "score": score, "reasons": reasons}

profile = {
    "latency_sla_seconds": 1,
    "team_size": 4,
    "build_age_years": 0,
    "mixed_batch_and_stream": True,
    "regulated_batch_required": False,
    "kafka_retention_days": 30,
}
print(architecture_recommendation(profile))
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

constraint lambda + kappa + lakehouse +
sub-1s SLA 0 1 2
4-person team 0 1 2
new build 0 0 2
mixed batch + stream 1 0 2
regulated batch 0 0 0
30-day retention 0 0 0
total 1 2 8
verdict lakehouse_streaming_sql

The matrix mechanises what good engineering judgment already says: a new, small-team build with mixed batch + stream needs and sub-second SLA is the textbook Lakehouse + Streaming SQL workload.

Output:

verdict score reasons
lakehouse_streaming_sql 8 sub-second SLA, small team, new build, mixed workload
kappa 2 sub-second SLA, small team
lambda 1 mixed workload (only)

Why this works — concept by concept:

  • Constraint-driven scoring — turns the "Lambda vs Kappa" debate into a data-driven matching exercise. Each constraint contributes votes to whichever architecture handles it best; the winner is the cumulative best fit.
  • New-build bias toward modern stacksbuild_age_years == 0 gives Lakehouse a free 2 points because there is no legacy migration cost. New builds are exactly the workload where the modern stack pays off.
  • Team-size penalty for Lambda — a 4-person team cannot sustainably own Spark + Flink + Cassandra + the merge query. The duplicate-code tax compounds with every metric; Lakehouse's single-SQL-surface flattens the cost curve.
  • Latency SLA discriminator — sub-second is doable on either Kappa (Flink) or Lakehouse (Materialize / RisingWave) but not on Lambda (5-15-minute speed layer freshness is the practical floor).
  • Regulated-batch escape hatch — Lambda survives where a regulator literally requires "the authoritative numbers must come from a separate batch job." For everyone else, the escape hatch is unused.
  • Kafka-retention link to Kappa — Kappa's reprocessing model assumes you can replay; 30-day retention gives 1 point but not 2. Long retention (365+ days) pushes Kappa back into contention for reprocessing-heavy workloads.
  • Cost — the recommendation function is O(constraints) — instant. The migration it implies is months; the recommendation just frames the choice. Big-O: O(N) over the constraint list.

SQL
Topic — event modeling
Event modeling problems (SQL)

Practice →


Cheat sheet — Lambda vs Kappa vs Lakehouse decision rules

  • When Lambda still wins. Regulated batch requirements (finance, healthcare) that demand a separate, auditable batch run. Legacy pipelines that have run nightly for 5+ years and survived multiple platform migrations — wrap them, do not rewrite them.
  • When Kappa still wins. Append-only event streams with sub-second SLAs, long Kafka retention (180+ days), single-team ownership of one streaming engine, and no heavy joins to slowly-changing dimensions.
  • When Lakehouse + Streaming SQL wins. New builds. Mixed batch + stream workloads in the same team. Analytics-engineering-first cultures where the metric definition is an SQL artifact in dbt or sqlmesh, not Java in a topology repo.
  • Reprocessing cost calculus. Kafka replay throughput is typically 3-10× live throughput per partition. If events_to_replay / partitions / replay_throughput is < 24 hours, replay-from-offset works. Otherwise, bootstrap-from-Iceberg + tail-from-Kafka is the only economic choice.
  • Serving layer choice. KV store (HBase / Cassandra / DynamoDB) for point lookups < 10ms. OLAP store (Druid / Pinot / ClickHouse) for slice-and-dice aggregates. Streaming SQL materialised view (Materialize / RisingWave) for live dashboards. The choice is not coupled to Lambda vs Kappa — pick the right serving substrate independently.
  • Code duplication smell test. If the same metric is implemented in Spark and Flink, you have the duplicate-code tax. Schedule a quarterly audit; budget migration once the count of dual-implemented metrics exceeds the team's tolerance for drift bugs.
  • Storage cost calculus. Kafka tiered storage is ~$0.023/GB/month on S3 in 2026; Iceberg on cold S3 is the same. Hot SSD / broker storage on Kafka is 10-50× more expensive. Push retention beyond 7-30 days into Iceberg early — that is also when bootstrap-from-Iceberg pays off.
  • Exactly-once budget. Flink + Kafka transactions give end-to-end EOS for ~20% overhead vs at-least-once. Cheaper than reconciliation jobs; budget it from day one.
  • Outbox pattern non-negotiable. Every Kappa or Lakehouse build that ingests from an OLTP source needs the outbox + CDC tail. Without it, the log is not the source of truth and reprocessing is unsafe.
  • dbt-streaming readiness. Test whether your Streaming SQL engine ships a dbt adapter; if yes, the metric lives in a .sql file in version control with tests. If no, the engine is probably not production-ready.
  • Strangler migration. Never flag-day off Lambda. Dual-write to Iceberg → shadow read from RisingWave → flip reads → retire old stack. 4 phases, 4-8 weeks each, 5-6 months total for a mature pipeline.
  • Lambda + Kappa hybrid (lambda-kappa). A surprisingly common 2018-2022 pattern: Kappa for the speed layer, batch jobs for the long-history backfills, Iceberg as the shared store. Calling it "hybrid" understates how often it is the actual destination of a Lambda migration.

Frequently asked questions

What is Lambda architecture in plain English?

Lambda architecture is a 2011 data-platform pattern invented by Nathan Marz at Twitter / BackType. It says "compute every metric twice — once in a slow, authoritative batch job over the immutable master dataset on HDFS / S3, and once in a fast, approximate streaming job over the live event source — then merge the two results in a serving database at query time." The batch layer guarantees correctness; the speed layer guarantees freshness; the serving layer guarantees the client sees one merged answer. The cost is the duplicate-code tax: every metric is implemented twice in two engines, which drifts over time and creates an integration-testing matrix that scales with the number of metrics. Lambda was the right answer in 2011 when Kafka was immature and streaming engines could not offer exactly-once; in 2026 most of its 2011 constraints have been retired by Iceberg, Delta, and exactly-once Flink.

What is Kappa architecture and how does it differ from Lambda?

Kappa architecture is Jay Kreps's 2014 proposal that Lambda's batch layer is unnecessary if Kafka is treated as the immutable log and a single exactly-once stream processor (Flink, Kafka Streams, Samza) computes every view. The differences are: one engine instead of two, one codebase instead of two, no merge query, and reprocessing-by-replay (point the new code at an older Kafka offset and recompute) instead of reprocessing-by-batch-overwrite. Kappa wins on append-mostly event streams with sub-second SLAs; it hurts on heavy historical backfills, complex joins to slowly-changing dimensions, and regulated workloads that require a separate authoritative batch run. The 2026 critique of Kappa is that the Lakehouse pattern (Iceberg / Delta + Streaming SQL) collapses the same dual-engine problem into a single SQL surface with even less operational overhead.

Is Lambda architecture still relevant in 2026?

Mostly only for two scenarios. First, regulated batch jobs (finance, healthcare, pharma) where a regulator literally requires the authoritative numbers come from a separate, auditable batch run with a documented input-output contract. Second, legacy pipelines that have run nightly for 5+ years, survived multiple platform migrations, and have well-known performance / cost characteristics — rewriting them for a Lakehouse pays off only if there is a concrete drift bug or compliance pressure. For new builds, Lambda is dominated by Lakehouse + Streaming SQL on almost every axis: same SQL for batch and stream, one storage layer, no merge query, no duplicate-code tax. Even teams that still operate Lambda are usually mid-migration to Iceberg / Delta as the shared store.

What is the duplicate-code tax?

The duplicate-code tax is the recurring engineering cost of implementing the same business logic in two engines — Spark for batch, Flink (or Storm or Samza) for speed. The tax has four components: code duplication (every metric is written twice), infra duplication (two clusters, two on-call rotations, two deployment pipelines), semantic drift (subtle differences in null handling, rounding, watermarking, and join semantics produce diverging numbers), and integration testing across engines (end-to-end tests must seed both layers, run the merge query, and assert the combined output matches an oracle). The tax is the single biggest reason teams migrate off Lambda; it scales linearly with the number of metrics and compounds operationally with every team turnover.

How does the Lakehouse + Streaming SQL stack replace Lambda and Kappa?

The Lakehouse + Streaming SQL stack has three layers: Kafka + Flink CDC for ingestion, Iceberg or Delta on S3 for storage, and Materialize / RisingWave / Flink SQL for the query surface. The same SQL definition compiles to either an incremental dataflow (streaming) or a scheduled scan (batch) depending on the engine. The duplicate-code tax disappears because there is only one codebase (the SQL artifact in dbt / sqlmesh / version control). The merge query disappears because there is only one view. Reprocessing is either a Streaming SQL view refresh (incremental engines handle this automatically) or an Iceberg time-travel batch query. The result is the architecture that makes the Lambda-vs-Kappa interview question obsolete — in 2026, "neither, use a Lakehouse" is the staff-engineer answer on most new builds.

When should I pick Kappa over Lambda for a new build?

Pick Kappa only when all five conditions hold: append-only event source (no OLTP mutations), sub-second freshness SLA, single small team that owns one stream engine, no heavy historical backfills planned, and Kafka retention budget for 180+ days (or an Iceberg archive for older history). If any condition fails, prefer Lakehouse + Streaming SQL — it dominates Kappa on the same workload because the storage layer (Iceberg) gives time-travel and cheap historical scans for free, and the query layer (Materialize / RisingWave / Flink SQL) gives the same incremental dataflow with a more analyst-friendly SQL surface. Kappa survives as a stylistic choice for teams that prefer raw Flink over Streaming SQL; the architecture itself is increasingly a special case of the Lakehouse pattern.

Practice on PipeCode

Pipecode.ai is Leetcode for Data Engineering — every reprocessing pattern above ships with hands-on practice rooms where you write the side-deploy orchestration, the outbox pattern, the bootstrap-with-batch hybrid, and the Streaming SQL materialised view against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so when an interviewer asks "Lambda or Kappa?" you can answer with the architecture that actually wins for the workload — and back it with code you have written before.

Practice streaming now →
Event processing drills →

Top comments (0)