Gowtham Potureddi

Posted on Jun 12

Apache Kafka Streams vs Apache Flink: Stateful Streaming Engines Compared

#python #sql #interview #dataengineering

kafka streams vs flink is the single biggest architecture call a streaming team makes in 2026 — and the one most teams get wrong on the first try because they treat both engines as "streaming frameworks" instead of as fundamentally different deployment shapes. Kafka Streams is a JVM library you embed inside your microservice. Flink is a cluster runtime with a JobManager, a fleet of TaskManagers, and its own scheduler. The DSL on top looks similar; the operational reality could not be more different.

This guide is the senior-DE comparison you wished existed the first time an interviewer asked "when do you pick apache flink vs kafka for stateful processing?" or "how does flink vs spark change once you need exactly-once joins on 100K events per second?" It walks through the Kafka Streams library model (KStream / KTable / GlobalKTable, per-partition stream tasks, embedded RocksDB stores backed by changelog topics), the Flink JobGraph + operator-chain runtime (StreamGraph → JobGraph → ExecutionGraph, TaskManagers, slot sharing, savepoints), state backends and exactly-once on both sides (flink checkpointing via Chandy-Lamport barriers vs kafka streams changelog replay, two-phase-commit sinks for end-to-end EOS), and the 5-question decision tree senior engineers use to pick the right engine for a new pipeline. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

When you want hands-on reps immediately after reading, drill the streaming practice library →, rehearse on streaming medium-difficulty problems →, and sharpen the Python axis with the Python streaming drills →.

On this page

Why a streaming engine choice matters in 2026
Kafka Streams library model
Flink JobGraph + operator chain
State backends + exactly-once
Picking the engine — co-located vs cluster
Cheat sheet — streaming engine recipes
Frequently asked questions
Practice on PipeCode

1. Why a streaming engine choice matters in 2026

Kafka Streams is a library, Flink is a cluster — the deployment shape, not the DSL, is what makes the engineering trade-off

The one-sentence invariant: Kafka Streams runs as a JVM library inside your process; Flink runs as a separate cluster of JobManager + TaskManagers that your job is deployed to. Every other comparison — DSL ergonomics, state model, exactly-once semantics, scale envelope, hiring pool — follows from that one structural difference. Once you internalise "library vs cluster," the entire kafka streams vs flink interview surface collapses to a sequence of consequences from that one choice.

Three axes interviewers actually probe.

State. Both engines keep keyed state in RocksDB. Kafka Streams replicates it continuously to a changelog topic in Kafka. Flink snapshots it periodically to a remote DFS (S3 / HDFS / GCS) via Chandy-Lamport barriers. The recovery story is different (replay-log vs restore-snapshot), and that difference dominates failover latency.
Exactly-once. Kafka Streams ships processing.guarantee=exactly_once_v2 which leans on Kafka's transactional producer. Flink ships TwoPhaseCommitSinkFunction which generalises the pattern to any sink that can pre-commit and commit. End-to-end EOS on both engines requires a transactional sink; getting that wrong is the most common production bug interviewers love to probe.
Scale. Kafka Streams parallelism is bounded by your input partition count — N partitions → N stream tasks, period. Flink parallelism is per-operator — a Source can run at parallelism 4, a KeyBy at 16, a Sink at 8, all in the same job. If your pipeline needs to scale operator-by-operator (the source is fast, the windowing is slow), Flink wins. If you can afford to keep partitions ≈ parallelism, Kafka Streams is dramatically simpler.

The DSL split — same name, different mental model.

Kafka Streams ships a Java/Scala DSL with KStream, KTable, GlobalKTable. The unit of work is a stream task pinned to an input partition. You think in terms of "this partition has its own RocksDB store; co-partitioning makes joins work."
Flink ships a Java/Scala/Python DataStream API and a SQL layer (Flink SQL) on top. The unit of work is an operator subtask that runs inside a TaskManager slot. You think in terms of "this operator runs at parallelism N; the keyed state is partitioned by hash; a barrier cuts a global snapshot every 60s."

The 2026 reality — what changed since 2022.

Kafka Streams 3.x added exactly_once_v2 (single transactional producer per stream thread instead of per task), dramatically reducing the producer-fencing overhead that made EOS-v1 expensive at scale.
Flink 1.18 / 2.0 stabilised the disaggregated state backend (ForStDB) and made async snapshotting the default — checkpoints no longer block the operator's hot path.
Spark Structured Streaming got real continuous-mode (sub-100ms) but still ships micro-batch as the default; the flink vs spark interview answer in 2026 is "Flink for sub-second latency, Spark when you already have a Spark batch team and 100ms-1s is fine."
Apache Pulsar and Redpanda are now common Kafka-protocol replacements; Kafka Streams works on Redpanda out of the box, Flink works on Pulsar via the official connector.

What interviewers listen for.

Do you say "Kafka Streams is a library, Flink is a cluster" in the first sentence? — senior signal.
Do you describe state as "RocksDB local + changelog (Kafka Streams) or RocksDB local + distributed snapshot (Flink)"? — required answer.
Do you mention "end-to-end exactly-once needs a transactional sink in both" unprompted? — senior signal.
Do you push back on "Flink replaces Kafka Streams" with "they target different deployment shapes"? — senior signal.

Worked example — same word-count, two engines

Detailed explanation. The classic streaming Hello World — count words per minute from a topic of strings — looks deceptively similar in both engines. The DSL reads almost identically. The runtime is wildly different: Kafka Streams launches inside your JAR with no scheduler; Flink compiles the topology into a JobGraph and submits it to a cluster. Both store the per-key count in RocksDB. Only Flink lets you scale the windowing operator independently of the source parallelism.

Question. Write a 1-minute tumbling word-count over a words Kafka topic in both Kafka Streams (Java DSL) and Flink (DataStream API). Highlight the runtime difference (where does the job actually run?), the parallelism model, and the state location.

Input.

event_time	word
12:00:05	kafka
12:00:10	flink
12:00:30	kafka
12:01:05	flink
12:01:40	flink

Code.

// Kafka Streams — runs inside YOUR JVM as a library
StreamsBuilder builder = new StreamsBuilder();
builder.<String, String>stream("words")
       .groupBy((k, v) -> v)
       .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(1)))
       .count(Materialized.as("word-count-store"))
       .toStream()
       .to("word-counts", Produced.with(WindowedSerdes.timeWindowedSerdeFrom(String.class), Serdes.Long()));

KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();   // your microservice now hosts the topology

// Flink DataStream API — submitted to a Flink cluster as a JobGraph
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(60_000);

DataStream<String> words = env.fromSource(
    KafkaSource.<String>builder().setTopics("words").build(),
    WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5)),
    "kafka-source");

words.keyBy(w -> w)
     .window(TumblingEventTimeWindows.of(Time.minutes(1)))
     .aggregate(new CountAggregator())
     .sinkTo(KafkaSink.<Tuple2<String, Long>>builder()
                       .setTransactionalIdPrefix("wc-")
                       .build());

env.execute("word-count");   // submits to JobManager — cluster runs it

Step-by-step explanation.

The DSL looks similar — both call keyBy / groupBy then window then aggregate. The difference is the call site: streams.start() in Kafka Streams launches inside the local JVM; env.execute() in Flink hands the JobGraph to a remote JobManager.
Kafka Streams creates one stream task per input partition — if words has 6 partitions, exactly 6 tasks run, distributed across whatever instances of your microservice are alive. State for each task lives in RocksDB on the same machine as the task.
Flink compiles keyBy → window → aggregate into an operator chain, then deploys it at the configured parallelism (independent of partition count). The keyed state is partitioned by the key's hash mod parallelism; RocksDB instances live in TaskManager slots, snapshotted to S3 on every checkpoint.
Both engines emit the windowed counts at window-close (or earlier with EMIT EARLY config in Kafka Streams suppress operator / Flink incremental aggregation). Output goes back to Kafka in both cases.
The key operational difference: Kafka Streams scales by adding microservice replicas; Flink scales by adding TaskManagers and reconfiguring the JobGraph parallelism.

Output.

window_start	window_end	word	count
12:00:00	12:01:00	kafka	2
12:00:00	12:01:00	flink	1
12:01:00	12:02:00	flink	2

Rule of thumb. If the team already runs Spring Boot / Quarkus microservices, Kafka Streams piggybacks on that infra and there is no new cluster to operate. If the team needs operator-parallel scaling, multi-source connectors, or sub-second latency on a 1M events/sec firehose, the Flink cluster pays for itself.

Worked example — partition parallelism vs operator parallelism

Detailed explanation. A common interview question — "you have a Kafka topic with 4 partitions but the per-key compute is CPU-bound at 10x the source rate. How does each engine scale?" The answer is the cleanest demonstration of the library-vs-cluster split.

Question. Given a Kafka topic events with 4 partitions and a CPU-bound aggregation step, show how Kafka Streams and Flink scale the aggregation independently of the source.

Input.

topic	partitions	source rate	per-key compute rate
events	4	40K msg/sec	4K msg/sec

Code.

// Kafka Streams — parallelism CAPPED at partition count (4)
// To scale, you MUST increase the partition count on the source topic.
builder.stream("events")
       .groupByKey()
       .aggregate(...)        // runs at partition parallelism = 4
       .toStream()
       .to("agg-output");

// Workaround: repartition first
builder.stream("events")
       .selectKey((k, v) -> newKey(v))
       .repartition(Repartitioned.numberOfPartitions(16))  // 16 internal partitions
       .groupByKey()
       .aggregate(...)        // now runs at parallelism 16
       .to("agg-output");

// Flink — operator parallelism is INDEPENDENT of source partitions
env.fromSource(kafkaSource, watermark, "src")
   .setParallelism(4)                                  // source = 4 (one per partition)
   .keyBy(e -> e.userId)
   .process(new HeavyAggregator())
   .setParallelism(16)                                 // aggregator = 16
   .sinkTo(kafkaSink)
   .setParallelism(8);                                 // sink = 8

Step-by-step explanation.

In Kafka Streams, the aggregation runs in the same stream task as the source — same partition, same JVM thread. Parallelism is fixed at partition count = 4. You cannot run 16 aggregator threads against 4 partitions; one task owns one partition end-to-end.
The workaround is to add a repartition step which writes to an internal Kafka topic with 16 partitions, then reads it back. That decouples source parallelism from downstream parallelism but adds a round-trip to Kafka (latency + storage cost).
In Flink, every operator has its own parallelism setting. The source runs at 4 (one subtask per partition), the heavy aggregator runs at 16, the sink runs at 8. Flink inserts a shuffle automatically — keyed network exchange — between operators with different parallelism.
The Flink scheduler maps the 16 aggregator subtasks across however many TaskManagers are available, using slot sharing to pack multiple operators into the same slot when possible.
End result: Flink can run the heavy step at 16 threads while Kafka Streams either runs at 4 or pays for an internal repartition topic.

Output (architecture comparison).

Aspect	Kafka Streams	Flink
Source parallelism	= partition count (4)	configurable (4)
Aggregator parallelism	= upstream parallelism unless you repartition	configurable (16)
Repartition cost	extra Kafka topic + round-trip	in-cluster shuffle, no Kafka
Scaling unit	partition count + microservice replicas	TaskManager slots

Rule of thumb. If your per-operator compute cost varies by 4x or more, Flink's operator parallelism pays for the cluster cost. If your topology is mostly source-paced, Kafka Streams stays simpler.

Worked example — DSL ergonomics on a join

Detailed explanation. Stream-table joins are the bread and butter of streaming pipelines: enrich an event stream with a slowly-changing dimension table. Both engines support it; the syntax and the semantics differ in important ways.

Question. Write a stream-table join that enriches every orders event with the latest customers profile. Show the Kafka Streams KStream-KTable join and the Flink keyed connect + state equivalent.

Input — orders (KStream / DataStream).

order_id	customer_id	amount
1	c1	50
2	c2	30
3	c1	75

Input — customers (KTable / KeyedState).

customer_id	tier
c1	gold
c2	silver

Code.

// Kafka Streams — natural KStream-KTable join
KTable<String, Customer> customers = builder.table("customers");
KStream<String, Order> orders = builder.stream("orders");

orders.join(customers,
            (order, customer) -> enrich(order, customer))
      .to("enriched-orders");

// Flink — connect two streams, keep customer in keyed broadcast state
DataStream<Order> orders = env.fromSource(orderSource, ws, "orders");
BroadcastStream<Customer> customers = env.fromSource(customerSource, ws, "customers")
                                          .broadcast(customerDescriptor);

orders.keyBy(o -> o.customerId)
      .connect(customers)
      .process(new KeyedBroadcastProcessFunction<>() {
          public void processElement(Order o, ReadOnlyContext ctx, Collector<Enriched> out) {
              Customer c = ctx.getBroadcastState(customerDescriptor).get(o.customerId);
              out.collect(enrich(o, c));
          }
          public void processBroadcastElement(Customer c, Context ctx, Collector<Enriched> out) {
              ctx.getBroadcastState(customerDescriptor).put(c.customerId, c);
          }
      })
      .sinkTo(enrichedSink);

Step-by-step explanation.

Kafka Streams treats the customers topic as a KTable — a changelog representation where each new record upserts the value for its key. The runtime materialises the table into RocksDB on the same instance.
The join operator co-partitions the streams: the orders topic and the customers topic must be co-partitioned (same key, same partition count, same partitioner). The DSL emits an internal repartition topic if co-partitioning fails.
Flink has no first-class KStream-KTable join; you build it with connect + KeyedBroadcastProcessFunction. The broadcast stream replicates every customer record to every subtask, and each subtask keeps its own copy in broadcast state.
For small lookup tables, broadcast state is fine. For large lookup tables, use Flink's LookupJoin against an external store (JDBC, HBase, async I/O) — that is the Flink idiom that has no Kafka Streams equivalent.
The output is the same: enriched orders keyed by customer_id. The operational shape differs: Kafka Streams co-locates the lookup state; Flink lets you choose between in-memory broadcast and external lookup.

Output.

order_id	customer_id	amount	tier
1	c1	50	gold
2	c2	30	silver
3	c1	75	gold

Rule of thumb. If your lookup is a Kafka topic with bounded cardinality, Kafka Streams KTable is the lower-friction answer. If your lookup is in a database or has > 10M keys, Flink's async lookup join is the only realistic option.

Senior interview question on engine choice

A senior interviewer often opens with: "Walk me through how you would choose between Kafka Streams and Flink for a new pipeline. What are the three or four questions you ask, in order, and what answer to each one pushes you to one engine over the other?"

Solution Using a 4-question decision framework

Decision framework — Kafka Streams vs Flink

1. Where does the data come from?
   - Kafka only           → Kafka Streams eligible
   - Kafka + Kinesis/JDBC/files → Flink (multi-source connectors)

2. What is the parallelism shape?
   - source rate ≈ compute rate, partition count is enough → Kafka Streams
   - per-operator parallelism differs by 4x or more       → Flink

3. What is the exactly-once contract?
   - Kafka → Kafka, transactional             → Kafka Streams EOS-v2 is enough
   - Kafka → JDBC / S3 / Iceberg, transactional → Flink TwoPhaseCommitSinkFunction

4. What is the operational shape?
   - Already running Spring Boot microservices, no platform team → Kafka Streams
   - Have or want a streaming platform team running a cluster   → Flink

Step-by-step trace.

Pipeline	Q1 source	Q2 parallelism	Q3 EOS	Q4 ops	Picked
User clickstream → enrich → Kafka	Kafka	source ≈ compute	Kafka → Kafka	microservices	Kafka Streams
Multi-source CDC → Iceberg	Kafka + JDBC	varies	EOS to Iceberg	platform team	Flink
Real-time fraud (heavy ML)	Kafka	compute >> source	Kafka → Kafka	platform team	Flink
Per-user state machine, low scale	Kafka	source ≈ compute	Kafka → Kafka	microservices	Kafka Streams

After the 4-question pass, the engine choice is usually unambiguous. The remaining 5% — where both engines work — defaults to whatever the team already operates.

Output:

Engine	When it wins
Kafka Streams	Kafka-only, partition-bounded parallelism, microservice deployment, EOS-v2 to Kafka
Flink	Multi-source, per-operator parallelism, EOS to non-Kafka sinks, cluster-runtime team

Why this works — concept by concept:

Library vs cluster — the structural axis — every other consequence (state, parallelism, EOS, ops) follows from where the engine runs. Asking the deployment question first short-circuits a lot of false dichotomies.
Partition vs operator parallelism — Kafka Streams is partition-bounded by design (one task per partition). Flink is operator-parallel (each operator has its own parallelism). If those two coincide for your workload, Kafka Streams is simpler.
EOS surface area — Kafka Streams EOS-v2 is bound to Kafka producers. Flink's EOS generalises via two-phase commit to any sink that implements TwoPhaseCommitSinkFunction (JDBC, S3, Iceberg, Pulsar).
Ops cost is real — running a Flink cluster requires a platform team (JobManager HA, TaskManager autoscaling, savepoint storage, monitoring). Kafka Streams piggybacks on whatever runs your microservices.
Cost — both engines are O(events) per record; the cost lives in operations, not throughput. Cluster TCO often dominates the cloud bill once you cross ~5 jobs.

Streaming
Topic — streaming
Streaming engine selection problems

Practice →

Streaming Topic — streaming · medium Medium-difficulty streaming problems

Practice →

2. Kafka Streams library model

`kafka streams` is a JVM library, not a server — one stream task per input partition is the whole parallelism story

The mental model in one line: a Kafka Streams app is a JAR that you run as a microservice; the runtime creates exactly one stream task per input partition, each task owns its slice of RocksDB state, and durability is provided by a per-task changelog topic written back to Kafka. Once you say that out loud, every kafka streams ktable interview question becomes a deduction from "tasks are partition-pinned and state is local."

The three core abstractions.

KStream — an append-only record stream. Every event is independent; "rate of events" is the only thing that matters. Operations: filter, map, flatMap, groupBy, join, repartition.
KTable — a changelog-backed materialised view. Each (key, value) upserts the previous value for that key. Operations: mapValues, groupBy + aggregate, join. Internally backed by a RocksDB store + changelog topic.
GlobalKTable — a KTable replicated on every instance of the app. Used for star-schema joins against small lookup tables. No co-partitioning required (joins don't need matching partitions on the stream side).

The stream-table duality.

A table is a snapshot of the latest value per key.
A stream is the sequence of changes that produced the table.
They are the same data, viewed two ways. toStream() converts a KTable into the underlying changelog stream; groupByKey().aggregate() converts a KStream into a KTable.

Stream tasks — the unit of parallelism.

One task per input partition. A topic with 12 partitions → 12 stream tasks. Period.
Tasks are scheduled across StreamThreads within a single JVM, and across multiple instances of the app. You scale horizontally by adding instances; Kafka Streams' consumer-group protocol rebalances tasks across the alive instances.
A task owns its slice of RocksDB. No cross-task state access. Co-partitioning rules force every input topic the task reads to share the same partition.

The changelog topic — durability without a separate store.

Every write to a state store is also written to an internal changelog topic with name appId-storeName-changelog. Compacted by design (one record per key).
On task restart on a different machine, the new owner replays the changelog from offset 0 to rebuild RocksDB.
For large stores, this can take minutes — Kafka Streams ships standby replicas config to keep warm copies on other instances, slashing recovery time.

Co-partitioning rules (the rule book interviewers love).

Same number of partitions on both input topics for a KStream-KStream or KStream-KTable join.
Same partitioner (default is murmur2 hash; if one side was written with a custom partitioner, joins fail).
Same key serde — the join uses the key bytes for equality.
If any rule fails, Kafka Streams either errors at build time or transparently inserts a repartition step (extra internal topic + round-trip to Kafka).

Interactive Queries — reading state out of the running app.

Kafka Streams exposes KafkaStreams#store(...) to query the local RocksDB stores from the running app's JVM.
A small HTTP layer (you write it; Kafka Streams gives you StreamsMetadata to route queries) turns the app into a queryable cache.
This is the "no separate KV store" pattern many teams use to avoid running a separate Redis / DynamoDB for read-side state.

Common interview probes on the library model.

"What is the relationship between input partitions and stream tasks?" — strictly 1:1.
"What happens to state when a task moves to another instance?" — replay the changelog; warm with standby replicas.
"When does Kafka Streams insert an internal repartition topic?" — when co-partitioning fails or when you call selectKey / groupBy with a different key.
"What is a GlobalKTable for?" — small lookup tables that you join from any partition without co-partitioning.

Worked example — KStream vs KTable on a click counter

Detailed explanation. A clicks topic emits (user_id, page_id) records. You want two views: the raw stream of clicks (for downstream pipelines) and the running click count per user (for a dashboard). KStream is the stream view; KTable is the aggregated view.

Question. Build both views on the same clicks topic. Show the DSL, the underlying state store, and how the two outputs differ.

Input.

time	user_id	page_id
12:00	u1	/home
12:01	u2	/home
12:02	u1	/pricing
12:03	u1	/home

Code.

StreamsBuilder builder = new StreamsBuilder();

// KStream view — every click forwarded as-is
KStream<String, String> clicks = builder.stream("clicks");
clicks.to("clicks-out");

// KTable view — running click count per user
KTable<String, Long> clickCount = clicks
    .groupByKey()
    .count(Materialized.as("click-count-store"));

clickCount.toStream().to("click-count-out", Produced.with(Serdes.String(), Serdes.Long()));

Step-by-step explanation.

clicks.to("clicks-out") is a pure stream pass-through — every input record produces one output record.
groupByKey().count() builds a KTable materialised as the click-count-store RocksDB store. Each input record updates the store and emits an update record to the changelog and the output topic.
After u1 /home at 12:00, the store has u1 → 1 and emits (u1, 1) to the count topic.
After u1 /pricing at 12:02, the store updates to u1 → 2 and emits (u1, 2). The previous (u1, 1) record is not deleted — it is superseded by the upsert in a compacted topic.
After u1 /home at 12:03, store updates to u1 → 3. Final state per user is the latest value per key in the changelog.

Output (KStream view — clicks-out).

user_id	page_id
u1	/home
u2	/home
u1	/pricing
u1	/home

Output (KTable view — click-count-out, compacted to latest).

user_id	count
u1	3
u2	1

Rule of thumb. If consumers want every event, model as KStream. If consumers want the latest value per key, model as KTable and emit to a compacted output topic. The compacted topic is a free read-side cache.

Worked example — co-partitioning a stream-stream join

Detailed explanation. A common bug: an interviewer asks you to join orders and payments by order_id, both written from different microservices. The orders topic has 12 partitions; the payments topic has 8. The naive orders.join(payments, ...) fails to compile (or worse, runs but produces wrong results in some versions) because co-partitioning is broken.

Question. Write a KStream-KStream join between orders and payments keyed on order_id, given that the two topics have different partition counts. Show the fix.

Input — orders (12 partitions).

order_id	amount
o1	50
o2	30

Input — payments (8 partitions).

order_id	paid
o1	50
o2	30

Code.

KStream<String, Order> orders     = builder.stream("orders");
KStream<String, Payment> payments = builder.stream("payments");

// BROKEN — co-partitioning fails (12 vs 8 partitions)
// orders.join(payments, ..., JoinWindows.ofTimeDifferenceWithNoGrace(Duration.ofMinutes(5)))
//       .to("matched");

// Fix 1 — explicit repartition to align partition counts
KStream<String, Payment> paymentsRepart = payments
    .repartition(Repartitioned.<String, Payment>as("payments-repart")
                              .withNumberOfPartitions(12));

orders.join(paymentsRepart,
            (o, p) -> new Match(o.id, o.amount, p.paid),
            JoinWindows.ofTimeDifferenceWithNoGrace(Duration.ofMinutes(5)))
      .to("matched");

Step-by-step explanation.

Kafka Streams checks co-partitioning at build time. If both sides come from topics with mismatched partition counts (or different partitioners), it throws a TopologyException instructing you to repartition.
payments.repartition(...) writes the stream to an internal Kafka topic with 12 partitions, keyed by the original order_id. The output KStream of that operator reads from the new internal topic.
Now both sides are co-partitioned: the same order_id lands in the same partition number on both streams, so the same stream task can join the two locally without network shuffles.
The join uses a JoinWindow — orders and payments within 5 minutes of each other match. The state is kept in a windowed RocksDB store (one per task) backed by a changelog.
The cost: one extra Kafka round-trip per payment record. If the topics will live forever, the cleaner fix is to make the producer write to 12 partitions in the first place so no internal repartition is ever needed.

Output.

order_id	amount	paid
o1	50	50
o2	30	30

Rule of thumb. Whenever you design a topic that will participate in joins, pick a partition count that matches every counterparty topic. The internal repartition topic is a real latency and storage cost; eliminate it at the schema design stage.

Worked example — GlobalKTable for a small dimension join

Detailed explanation. A clicks stream needs enrichment with a small geo lookup (~200 country codes). Co-partitioning a 200-row table to match the 100-partition clicks topic is wasteful. GlobalKTable replicates the whole table on every instance, so any partition can do the lookup locally.

Question. Enrich a clicks KStream with country_name from a small country_codes topic using a GlobalKTable. Show why this avoids the co-partitioning constraint.

Input — clicks (100 partitions).

user_id	country_code
u1	US
u2	DE

Input — country_codes (1 partition).

country_code	country_name
US	United States
DE	Germany

Code.

GlobalKTable<String, String> countries = builder.globalTable("country_codes");
KStream<String, Click> clicks = builder.stream("clicks");

clicks.join(countries,
            (clickKey, click) -> click.countryCode,  // key extractor on the stream side
            (click, countryName) -> click.withCountry(countryName))
      .to("clicks-enriched");

Step-by-step explanation.

builder.globalTable("country_codes") materialises the whole topic into a local RocksDB store on every instance of the app. All 200 country codes live on every node.
The join signature takes a key extractor (because the stream key is user_id but the join key is country_code). No selectKey is needed; no internal repartition topic is created.
Every stream task — regardless of its partition assignment — can look up any country code from its local copy of the global table. The lookup is a single RocksDB get.
The trade-off: every instance pays the memory cost of the full table. Worth it for 200 rows; catastrophic for 200M rows.

Output.

user_id	country_name
u1	United States
u2	Germany

Rule of thumb. Use GlobalKTable only when the lookup table fits in memory on every node (rule of thumb: < ~1 GB compacted). For larger lookups, use a regular KTable with co-partitioning, or hop out to Flink + async lookup join.

Senior interview question on Kafka Streams parallelism

A senior interviewer might ask: "Your Kafka Streams app processes 200K msg/sec from a 4-partition topic. You've added 10 instances of the app and CPU is still pegged on 4 of them. What is happening and how do you fix it?"

Solution Using partition-count expansion + standby replicas

// Diagnosis: 4 partitions = max 4 stream tasks = max 4 CPU cores doing work
// The other 6 instances are idle (or running standby replicas).

// Fix 1 — increase partition count on the source topic to 24
// (one-time, via kafka-topics --alter --partitions 24)

// Fix 2 — increase num.stream.threads per instance to use more cores per JVM
props.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 6);

// Fix 3 — keep 1 standby replica per task to shorten recovery on rebalance
props.put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 1);

Step-by-step trace.

Step	Before fix	After fix
Topic partitions	4	24
Max stream tasks	4	24
Active instances doing work	4 / 10	up to 24 / 10
Threads per instance	1	6
Standby replicas per task	0	1
Recovery on rebalance	minutes (replay full changelog)	seconds (warm RocksDB)

After the fix, the 10 instances run 24 tasks distributed roughly 2-3 per instance. With num.stream.threads=6 and 2-3 tasks per instance, all 6 threads have work; CPU is balanced.

Output:

Metric	Before	After
Throughput	200K msg/sec (capped)	1.2M msg/sec (linear scale)
Idle instances	6	0
Recovery time	~3 min	~10 sec
Internal topic count	0	0 (no repartition added)

Why this works — concept by concept:

Partition count is the throughput ceiling — Kafka Streams creates exactly one stream task per partition. You cannot run more tasks than partitions, no matter how many instances you add. Increasing partitions is the only way to grow the work pool.
num.stream.threads = cores per JVM — once the topic has enough partitions, each instance can run multiple tasks in parallel. Setting num.stream.threads ≈ CPU cores maxes out the JVM.
Standby replicas warm RocksDB — without standbys, a rebalance forces the new task owner to replay the whole changelog from offset 0. With num.standby.replicas=1, one other instance keeps a hot copy; rebalances take seconds.
No internal repartition needed — the fix changes partition count once on the source topic, not in the topology, so no internal repartition topic is created. Lower latency and storage cost.
Cost — increasing partitions is permanent and irreversible; pick a count that covers 3-5 years of growth. Per-record CPU cost is unchanged; recovery cost drops from O(changelog size) to O(delta since last standby tick).

Streaming
Topic — streaming
Kafka Streams parallelism problems

Practice →

Streaming Topic — streaming · python Streaming problems in Python

Practice →

3. Flink JobGraph + operator chain

`flink jobgraph` is what your DSL becomes — operators chained, parallelism per operator, slots shared, snapshots cut by barriers

The mental model in one line: a Flink program is a StreamGraph (your DSL), compiled to a JobGraph (chained operators), then deployed as an ExecutionGraph (parallel subtasks) onto TaskManager slots managed by a JobManager. Once you say "the JobGraph chains operators and each operator has its own parallelism," the entire apache flink vs kafka operator-parallelism interview surface becomes a sequence of consequences from that compilation pipeline.

The compilation pipeline.

StreamGraph — the raw user DSL: every operator is its own node, every edge is a network connection. Built client-side from env.fromSource(...)... .keyBy(...)... .sinkTo(...).
JobGraph — the optimised graph: adjacent operators with the same parallelism and forward partitioning are chained into a single task. Chaining eliminates the serialisation between operators.
ExecutionGraph — the parallel deployment: each operator's parallelism setting expands into N subtasks; the JobManager assigns subtasks to TaskManager slots.

The cluster topology.

JobManager — one master process per job (HA mode: standby JobManagers via ZooKeeper / K8s). Owns the checkpoint coordinator, the scheduler, and the job's metadata.
TaskManager — worker processes, one or more. Each TaskManager runs a fixed number of slots (configurable, typically = CPU cores). Each slot can host multiple operator subtasks via slot sharing.
Slot sharing groups — by default, every operator in a job belongs to the same slot-sharing group, which lets multiple operators co-locate in the same slot (less network shuffle, more efficient).

Parallelism — per-operator, not per-job.

Every operator has its own setParallelism(N). Without an explicit setting, Flink uses the job's default parallelism.
A keyBy reshuffles records by hash so that all records with the same key go to the same downstream subtask. Keyed state is partitioned the same way.
Mismatched parallelism between adjacent operators triggers a rebalance / hash / rescale partitioning between them.

Operator state vs keyed state.

Operator state (broadcast, list, union) — per-subtask state independent of key. Useful for connector offsets, broadcast lookups.
Keyed state (ValueState, ListState, MapState, ReducingState) — per-key state inside a KeyedStream. The runtime guarantees that all records with the same key go to the same subtask, so keyed state access is local.
State backend — HashMapStateBackend (heap, JVM-bounded) or EmbeddedRocksDBStateBackend (off-heap, disk-bounded). Snapshotted on checkpoint.

Checkpoints vs savepoints — the operational split.

Checkpoints — automatic, periodic, fast restore on failure. Internal format; tied to a job execution. Configured via env.enableCheckpointing(interval).
Savepoints — user-triggered, portable, versioned snapshots used for planned restarts, upgrades, or topology migrations. Survive cross-Flink-version restores.
Both are produced by the same Chandy-Lamport barrier mechanism, but savepoints get extra metadata and are intentionally kept user-owned.

Backpressure — credit-based flow control.

Each consumer subtask grants the upstream producer a credit for every free buffer slot. The producer cannot send beyond granted credits.
When a downstream operator is slow, its credit pool dries up; the upstream operator stops sending and slows down its own input. The pressure propagates back to the source.
Flink's web UI exposes backpressure per operator — "HIGH" backpressure on an operator means that operator is the bottleneck.

Common interview probes on the JobGraph.

"What is the difference between StreamGraph and JobGraph?" — StreamGraph is the raw DSL; JobGraph is the optimised chained graph.
"What is operator chaining?" — adjacent operators with same parallelism and forward partitioning run in the same task without serialisation between them.
"What is the difference between a checkpoint and a savepoint?" — checkpoints are automatic, internal, and tied to a running job; savepoints are user-triggered and portable across Flink versions.
"What is slot sharing?" — multiple operator subtasks share a single TaskManager slot to reduce network shuffle and pack more parallelism per machine.

Worked example — same DSL, different parallelism per operator

Detailed explanation. A Flink job reads from a 4-partition Kafka topic, runs a heavy enrichment step, then writes to a 4-partition sink. The enrichment is CPU-bound; the team wants it at 16-way parallelism while keeping the source and sink at 4. Operator chaining and parallelism settings make this trivial in Flink.

Question. Write a Flink job with source parallelism 4, enrichment parallelism 16, and sink parallelism 4. Show the JobGraph nodes and which subtasks live in which TaskManager slot.

Input.

Operator	Parallelism	Notes
Kafka source	4	one subtask per partition
Map (filter)	4	chained with source
Heavy enrichment	16	CPU-bound, needs scale
Kafka sink	4	matches output partition count

Code.

DataStreamSource<Event> src = env.fromSource(
        KafkaSource.<Event>builder().setTopics("events").build(),
        WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5)),
        "events-src")
    .setParallelism(4);

DataStream<Event> filtered = src
    .filter(e -> e.kind != null)
    .setParallelism(4);   // chains with source

DataStream<Enriched> enriched = filtered
    .keyBy(e -> e.userId)                   // hash partition by userId
    .process(new HeavyEnrichmentFn())
    .setParallelism(16);                    // CPU-bound, scale out

enriched.sinkTo(KafkaSink.<Enriched>builder()
                          .setTransactionalIdPrefix("enriched-")
                          .build())
        .setParallelism(4);                 // matches sink partition count

Step-by-step explanation.

env.fromSource(...).setParallelism(4) creates 4 source subtasks, one per Kafka partition. Each subtask reads from its assigned partition and emits records downstream.
.filter(...).setParallelism(4) runs at the same parallelism as the source with forward partitioning, so Flink chains the source and filter into a single task: no serialisation between them, both run in the same TaskManager slot.
.keyBy(...) is a hash partitioning operation. It breaks the chain — the next operator has parallelism 16 (different from upstream 4), so Flink inserts a network shuffle that hash-partitions records by userId across 16 downstream subtasks.
.process(new HeavyEnrichmentFn()).setParallelism(16) runs the CPU-bound function at 16-way parallelism. Each subtask handles ~1/16 of the keys.
.sinkTo(...).setParallelism(4) runs the sink at 4. Another shuffle from 16 → 4 (rebalance, round-robin by default).
Slot sharing means the 4 source-filter subtasks, the 16 enrichment subtasks, and the 4 sink subtasks can co-locate in slots if there is room. The JobManager packs them efficiently across the available TaskManager slots.

Output (JobGraph shape).

Node	Parallelism	Chained with	Shuffle in
Source + Filter	4	(chained)	none
Heavy Enrichment	16	(own task)	hash by userId
Sink	4	(own task)	rebalance

Rule of thumb. Use setParallelism per operator on any DAG where compute cost varies by step. Use the same parallelism for chained sequential operators (source → filter → map) to keep them in one task with zero serialisation overhead.

Worked example — checkpoint barriers in flight

Detailed explanation. Flink's checkpoint mechanism is the Chandy-Lamport algorithm adapted to streaming. The JobManager periodically injects a barrier into every source subtask. Each operator forwards the barrier to its outputs only after persisting its own state for that barrier. When the sinks see the barrier, the checkpoint is complete.

Question. Trace a single checkpoint through a 3-operator pipeline (Source → KeyBy + Aggregate → Sink) at parallelism 2. Show what each subtask does at the moment a barrier passes through it.

Input.

Time	Event
t=0	barrier-7 injected by JobManager to both source subtasks

Code.

env.enableCheckpointing(60_000);                              // every 60s
env.getCheckpointConfig().setCheckpointStorage("s3://flink/cp");
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(5_000);
env.getCheckpointConfig().setCheckpointTimeout(120_000);
env.getCheckpointConfig().enableUnalignedCheckpoints();

// Setting it once on the env applies to every operator.

Step-by-step explanation.

JobManager's CheckpointCoordinator generates checkpoint id 7 and sends a "trigger" RPC to every source subtask.
Each source subtask records its read offset (e.g. Kafka offset 12345 for partition 0, 12678 for partition 1), persists the offset to the state backend, then emits a checkpoint barrier (a special record containing checkpoint id 7) into the output stream.
The KeyBy + Aggregate operator receives the barrier from one of its input channels. It buffers further records on that channel until the barrier arrives on all input channels (this is "barrier alignment" in aligned mode).
Once aligned, the operator atomically snapshots its keyed state (the per-key running aggregates) to the state backend, then forwards the barrier downstream.
With unaligned checkpoints, step 3 changes: the operator does not wait for barrier alignment; it captures in-flight records on the slower channels into a side buffer and snapshots immediately. Trades extra state size for lower checkpoint latency under backpressure.
The sink receives the barrier, completes any pending transactional commit, and acknowledges to the JobManager. Once all sinks acknowledge, checkpoint 7 is complete and is the new restore point.
If any subtask fails before completion, the JobManager restarts the job from checkpoint 6 (or whichever was the last completed one).

Output (per-subtask state at barrier-7 completion).

Subtask	Persisted state
Source-0	Kafka offset 12345
Source-1	Kafka offset 12678
Aggregate-0	RocksDB snapshot for keys hash mod 2 == 0
Aggregate-1	RocksDB snapshot for keys hash mod 2 == 1
Sink-0	last-committed Kafka producer transaction id
Sink-1	last-committed Kafka producer transaction id

Rule of thumb. Aligned checkpoints are correct but slow under backpressure. Enable unaligned checkpoints on any job with bursty load or skewed keys — the extra state size is small compared to the checkpoint-timeout you would otherwise hit.

Worked example — savepoint upgrade with topology change

Detailed explanation. A live Flink job needs a new operator added between source and the existing aggregate. You cannot just redeploy — the new job's operator IDs differ, so checkpoint restore fails. The pattern is: take a savepoint, stop the old job, add uid() tags to every stateful operator, redeploy from the savepoint with the new topology.

Question. A Flink job runs in production. The team wants to insert a new filter operator before the aggregate. Show the savepoint + restart sequence and the uid() tagging discipline that makes it work.

Input.

Old DAG	New DAG
Source → Aggregate → Sink	Source → Filter → Aggregate → Sink

Code.

# 1) Take a savepoint of the running job
flink savepoint <jobId> s3://flink/savepoints/my-job

// 2) Redeploy with a NEW operator + explicit uid() tags
DataStreamSource<Event> src = env.fromSource(kafkaSource, ws, "src")
                                  .uid("src-1")            // STABLE id across restarts
                                  .setParallelism(4);

DataStream<Event> filtered = src
    .filter(e -> e.kind != null)
    .uid("filter-1")                                       // NEW operator — fresh id
    .setParallelism(4);

DataStream<Agg> aggregated = filtered
    .keyBy(e -> e.userId)
    .process(new AggregateFn())
    .uid("agg-1")                                          // SAME id as before
    .setParallelism(16);

aggregated.sinkTo(kafkaSink)
          .uid("sink-1")                                   // SAME id as before
          .setParallelism(4);

# 3) Start from the savepoint
flink run -s s3://flink/savepoints/my-job/savepoint-... -allowNonRestoredState my-job.jar

Step-by-step explanation.

The savepoint captures every operator's state by operator uid. Without an explicit uid(), Flink generates one from the operator's position in the DAG — any topology change rotates the uid and breaks restore.
Tagging every stateful operator with a stable uid() makes the topology change safe: the runtime maps savepoint state back to the matching uid even if the DAG order changed.
The new filter-1 operator has a fresh uid; it has no prior state to restore. Flink's -allowNonRestoredState flag prevents the restore from erroring on the new operator.
The Aggregate operator's uid is unchanged → its RocksDB state from the old job is restored into the same operator in the new job.
Once running, the new job resumes processing from the savepoint's offsets with the new filter step inserted.

Output (state mapping during restore).

Savepoint operator	New job operator	Mapped?
src-1	src-1	yes
—	filter-1 (new)	no (fresh state)
agg-1	agg-1	yes
sink-1	sink-1	yes

Rule of thumb. Tag every stateful operator with an explicit .uid(...) from day one. The first time you need a topology upgrade, you will be glad you did. Without it, you take a downtime for state reset.

Senior interview question on JobGraph compilation

A senior interviewer might ask: "Explain what happens to a Flink program between env.execute() and the moment a record is actually processed on a TaskManager. What graphs get built, what optimisations happen, and what is the role of each component?"

Solution Using the StreamGraph → JobGraph → ExecutionGraph pipeline

Client side
-----------
1. User builds DataStream API → StreamGraph
2. StreamGraph optimised (chain compatible operators) → JobGraph
3. JobGraph + JAR submitted via REST to JobManager

JobManager side
---------------
4. JobManager receives JobGraph
5. Each JobVertex expanded by parallelism → ExecutionGraph (parallel subtasks)
6. Scheduler requests slots from ResourceManager (one slot per subtask, modulo slot sharing)
7. Subtasks deployed to TaskManager slots; checkpoint coordinator initialised

TaskManager side
----------------
8. TaskManagers receive the TaskDeploymentDescriptor
9. Each slot loads the user code, sets up the state backend, opens network channels
10. Source subtasks begin reading; records flow through the operator chain

Step-by-step trace.

Stage	Where	Output
1	Client JVM	StreamGraph (user-DSL DAG)
2	Client JVM	JobGraph (chained operators)
3	Client → REST	JAR + JobGraph submitted
4	JobManager	JobGraph parsed
5	JobManager	ExecutionGraph (subtasks per operator)
6	JobManager + RM	Slot offers granted
7	JobManager	Subtasks deployed
8	TaskManager	TDD received
9	TaskManager	Operator chains initialised
10	TaskManager	Records flow

The pipeline distinguishes three graphs because each stage has its own purpose: StreamGraph is the user model, JobGraph is the optimised deployment unit, ExecutionGraph is the runtime parallel form.

Output:

Component	Owns	Talks to
Client	StreamGraph → JobGraph	JobManager (REST)
JobManager	JobGraph + ExecutionGraph + CheckpointCoordinator	TaskManagers (RPC)
ResourceManager	Slot pool	JobManager + TaskManagers
TaskManager	Slots + user code + state backend	JobManager (heartbeat + checkpoints)

Why this works — concept by concept:

Three graphs, three concerns — StreamGraph is what you wrote, JobGraph is what gets deployed, ExecutionGraph is what actually runs. Each stage is a refinement; you do not jump straight from DSL to subtasks.
Operator chaining at JobGraph — chain-compatible operators (same parallelism, forward partitioning) fuse into a single task. Eliminates serialisation between them. Caveat: chaining is broken by keyBy, rebalance, broadcast.
Slot sharing at ExecutionGraph — by default, every operator in a job shares slots. Lets the scheduler pack source + map + sink subtasks for the same parallelism slice into the same slot, reducing total slot demand.
Checkpoint coordinator lives in JobManager — barriers originate from the JM and acknowledgements return to the JM. The JM is the single point of consistency for the job.
Cost — compilation and submission are O(operators); runtime is O(events). The bookkeeping is constant per checkpoint, so checkpoint overhead is bounded.

Streaming
Topic — streaming
Flink JobGraph problems

Practice →

Streaming Topic — streaming · python · medium Medium Python streaming drills

Practice →

4. State backends + exactly-once

`flink checkpointing` snapshots state to remote storage; Kafka Streams replicates state to a changelog topic — different mechanisms, same exactly-once guarantee

The mental model in one line: both engines keep keyed state in RocksDB; Kafka Streams replicates every state change to a Kafka changelog topic for durability, while Flink snapshots state periodically to remote storage via Chandy-Lamport barriers — and both engines achieve end-to-end exactly-once by pairing this state durability with a transactional sink. Once you say "durability mechanism differs, EOS semantics align," the exactly-once streaming interview surface becomes a deduction from those two halves.

Kafka Streams state durability — changelog replay.

Every write to a state store goes to a co-located changelog topic in Kafka with name <application.id>-<store-name>-changelog.
The changelog topic is compacted by default — exactly one record per key, so it converges to the size of the state store, not the lifetime stream.
On task move (rebalance or restart), the new owner reads the changelog topic from offset 0 to rebuild the RocksDB store.
Standby replicas keep warm copies on other instances. With num.standby.replicas=1, recovery is O(delta since last warm-up tick) instead of O(full changelog).

Flink state durability — distributed snapshot.

The state backend is either HashMapStateBackend (heap-only, JVM-bounded) or EmbeddedRocksDBStateBackend (off-heap + disk, multi-TB capable).
On every checkpoint, the state is snapshotted to a configured CheckpointStorage (file URI: s3://, hdfs://, gs://).
The snapshot is incremental in RocksDB mode — only the SST files that changed since the last checkpoint are uploaded. Restoration loads the chain of incremental snapshots.
The snapshot is asynchronous by default — the operator hot path keeps processing while the state is copied to remote storage in the background.

The EOS mode ladder.

At-least-once — the default for both engines if you don't enable EOS. Records are processed and committed; on failure, the source replays from the last committed offset, so some records can be reprocessed.
Idempotent — duplicates are produced but the sink dedupes (e.g. unique constraint on output, upsert by primary key). End-state-exactly-once even if events are not.
Transactional / EOS-v2 — sources rewind to the last committed checkpoint, operators replay deterministically, and the sink commits atomically with the checkpoint. No duplicates ever leave the job.

Kafka Streams EOS-v2.

Set processing.guarantee=exactly_once_v2 in StreamsConfig.
The runtime uses a single transactional producer per stream thread (instead of per task in EOS-v1) — dramatically lower transaction overhead.
Each commit ticks a Kafka transaction across input offsets + state changelog writes + output topic writes — all-or-nothing.
Requires Kafka 2.5+ on the broker side and min.in.sync.replicas ≥ 2 for the transactional __transaction_state topic.

Flink TwoPhaseCommitSinkFunction.

An abstract class with three methods: beginTransaction(), preCommit(txn), commit(txn).
On every checkpoint barrier, the sink calls preCommit — buffer flushed but not yet visible.
On checkpoint completion, the JobManager notifies all sinks; each sink calls commit to publish the buffered batch atomically.
Concrete implementations: KafkaSink (Kafka transactional producer), FileSink (temp file → atomic rename), JdbcSink (transaction), Iceberg sink (commit a new snapshot).

The three EOS failure modes interviewers ask about.

Source replay without idempotence on the sink. Sources rewind on failure; the sink sees duplicates. Caught by: "what is your sink's idempotence story?"
State not in the snapshot. A new operator (or a missing uid()) means state was not restored; counts double on restart. Caught by: "do you tag every stateful operator with uid()?"
Sink commit not atomic with the checkpoint. Sink writes commit before the source acks → duplicates. Caught by: "does your sink implement TwoPhaseCommit?"

Worked example — exactly-once in Kafka Streams

Detailed explanation. A Kafka Streams app reads from orders, computes a running per-user total, writes to totals. The team wants exactly-once: no duplicate totals, no missed orders, even across crashes.

Question. Configure a Kafka Streams app for EOS-v2 and explain what the runtime now does differently. Show the config and the producer-side transaction shape.

Input.

Topic	Role
orders	source
totals	sink
-running-totals-changelog	internal state durability

Code.

Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "totals-app");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka:9092");
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, "exactly_once_v2");
props.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 3);          // for internal topics

StreamsBuilder builder = new StreamsBuilder();
builder.<String, Order>stream("orders")
       .groupBy((k, v) -> v.userId)
       .aggregate(() -> 0L,
                  (k, v, agg) -> agg + v.amount,
                  Materialized.<String, Long>as("running-totals"))
       .toStream()
       .to("totals", Produced.with(Serdes.String(), Serdes.Long()));

KafkaStreams app = new KafkaStreams(builder.build(), props);
app.start();

Step-by-step explanation.

With exactly_once_v2, every stream thread holds a single transactional producer. The transactional id is derived deterministically from application.id + stream-thread id.
On each commit.interval.ms tick (default 100ms with EOS), the runtime opens a transaction. Within the transaction it: reads input offsets, writes state-changelog records, writes output records, advances the consumer's committed offsets.
On commit, all four writes are atomic. Either all are visible or none — no duplicates, no lost updates.
On crash, the new owner of the task fences the old transactional producer (Kafka's transactional fencing protocol) and starts a fresh transaction from the last committed offset. The state store rebuilds from the changelog up to that offset.
The price of EOS-v2 is higher latency at commit boundaries (transactions add a round-trip) and the requirement that the Kafka brokers support transactions (isr=2+, min.in.sync.replicas=2).

Output (per-transaction atomic write set).

Write	Visible if commit succeeds
Updated state-changelog records	yes
Output records to `totals`	yes
Advanced consumer offsets on `orders`	yes
All-or-nothing on failure	yes

Rule of thumb. Use EOS-v2 by default on any Kafka-only pipeline. The latency overhead (~10ms at commit) is negligible compared to the engineering cost of a custom dedup pipeline.

Worked example — TwoPhaseCommitSinkFunction in Flink

Detailed explanation. A Flink job reads from Kafka, computes hourly aggregates, and writes them to Iceberg. The team wants end-to-end exactly-once. Flink's KafkaSource is rewind-safe on its own; the Iceberg sink must implement TwoPhaseCommit so that a checkpoint and the Iceberg commit happen atomically.

Question. Show how Flink's TwoPhaseCommit sink integrates with checkpointing. Walk through the four phases for a single checkpoint.

Input.

Stage	Action
Checkpoint barrier injected	t=0
Sink preCommit	t=1
Checkpoint complete	t=2
Sink commit (notifyCheckpointComplete)	t=3

Code.

public class IcebergTwoPhaseSink extends TwoPhaseCommitSinkFunction<Row, TxnHandle, Void> {
    @Override
    protected TxnHandle beginTransaction() {
        // open a new Iceberg writer; returns a handle
        return new TxnHandle(icebergTable.newAppend());
    }

    @Override
    protected void invoke(TxnHandle txn, Row row, Context ctx) {
        txn.append(row);                       // buffered to local files
    }

    @Override
    protected void preCommit(TxnHandle txn) {
        txn.flushFilesToObjectStore();         // visible to txn only
    }

    @Override
    protected void commit(TxnHandle txn) {
        txn.appendOp.commit();                 // atomic Iceberg metadata flip
    }

    @Override
    protected void abort(TxnHandle txn) {
        txn.deleteUncommittedFiles();
    }
}

Step-by-step explanation.

The sink begins a transaction the moment a checkpoint barrier triggers (after the previous commit). All records between barriers go into this transaction.
On the barrier, preCommit is called: data files are flushed to object storage but not yet visible to the table (no metadata update). The transaction is durable but pending.
The checkpoint metadata (transaction id, snapshot pointers) is included in the checkpoint state and persisted to remote storage.
Once the JobManager confirms checkpoint completion, every sink instance receives notifyCheckpointComplete → calls commit → flips the Iceberg metadata pointer atomically. The new data is now visible to all readers.
If the job crashes between preCommit and commit, recovery re-reads the last completed checkpoint and finds the pending transaction. It either retries the commit (idempotent in Iceberg) or aborts and re-runs from the prior checkpoint.

Output (Iceberg snapshot visibility timeline).

Time	Sink state	Iceberg snapshot
t=0	begin txn N	snapshot N-1
t=0..1	records buffered	snapshot N-1
t=1	preCommit (files flushed)	snapshot N-1
t=2	checkpoint complete	snapshot N-1
t=3	commit (metadata flip)	snapshot N

Rule of thumb. Implement TwoPhaseCommit on any sink that supports atomic publish (transactional Kafka, Iceberg, JDBC, S3 with rename). For sinks that do not, fall back to idempotent writes with a dedup column.

Worked example — RocksDB tuning for large state

Detailed explanation. Both engines use RocksDB under the hood. For multi-TB state, RocksDB tuning becomes the dominant operational concern: block cache size, write buffer, compaction triggers. The defaults work for low-GB state; large state needs hand-tuning.

Question. A Flink job has a 500GB keyed state. Default RocksDB settings cause high latency on writes and slow recovery. Show the key config knobs and how each one affects performance.

Input.

State size	Default RocksDB issue
500 GB / TaskManager	small block cache → cache miss on every read

Code.

# flink-conf.yaml — large-state RocksDB tuning
state.backend: rocksdb
state.backend.incremental: true
state.backend.rocksdb.localdir: /mnt/nvme/rocksdb            # fast SSD

# Memory budget per slot (off-heap)
state.backend.rocksdb.memory.managed: true
state.backend.rocksdb.memory.fixed-per-slot: 4 GB

# Write buffer + L0 trigger
state.backend.rocksdb.writebuffer.size: 256 MB
state.backend.rocksdb.writebuffer.count: 4

# Block cache + bloom filters
state.backend.rocksdb.block.cache-size: 2 GB
state.backend.rocksdb.block.bloomfilter: true

# Compaction
state.backend.rocksdb.compaction.level.use-dynamic-size: true
state.backend.rocksdb.thread.num: 4

Step-by-step explanation.

state.backend.rocksdb.localdir on a fast NVMe drive avoids the EBS/EFS latency penalty. RocksDB is I/O-heavy; SSD is non-negotiable.
memory.managed=true lets Flink budget RocksDB memory off-heap, preventing OOM on the JVM heap. fixed-per-slot partitions it predictably.
writebuffer.size + writebuffer.count controls L0 flush frequency. Larger buffers reduce write amplification at the cost of higher memory.
block.cache-size caches the most-read SST blocks. With a 2 GB cache hitting 80%+ of reads, latency drops 10x compared to the default 8 MB cache.
bloomfilter skips disk reads for keys that definitely do not exist in an SST. Especially useful for point-lookup keyed state.
compaction.level.use-dynamic-size + thread.num lets compaction scale with write volume; without it, L0 backs up under bursty writes and stalls foreground writes.

Output.

Tuning	Default	Tuned	Effect
block.cache-size	8 MB	2 GB	10x read latency drop
writebuffer.size	64 MB	256 MB	4x lower write amplification
localdir	EBS gp2	NVMe	5x I/O latency drop
memory.managed	heap	off-heap	no OOM on heavy state

Rule of thumb. Provision NVMe local disks for any Flink job with >100 GB state. Use the managed-memory off-heap config; never let RocksDB share JVM heap with operator code. Recover times scale linearly with state size — keep checkpoints incremental and small.

Senior interview question on end-to-end exactly-once

A senior interviewer might ask: "Walk me through, end-to-end, how exactly-once works in Flink. What does the source do, what does the operator do, what does the sink do, and what is the JobManager's role? Where can it break?"

Solution Using barrier-aligned snapshot + TwoPhaseCommit sink

End-to-end exactly-once in Flink
================================

Source side
-----------
1. Kafka source records its read offsets as part of operator state.
2. On checkpoint, the offset becomes part of the snapshot.
3. On restore, source rewinds to the snapshot's offset.

Operator side
-------------
4. Stateful operator's keyed state lives in RocksDB.
5. On barrier, operator buffers any extra input until barrier alignment
   (or with unaligned checkpoints, captures in-flight records into the snapshot).
6. Operator snapshots its state to remote storage (S3/HDFS) asynchronously.

Sink side (TwoPhaseCommit)
--------------------------
7. On barrier, sink's preCommit() flushes buffered output to durable but
   not-yet-visible storage (e.g. Kafka transactional producer pre-commit,
   Iceberg pending snapshot).
8. JobManager waits for every operator + sink to ack the barrier.
9. JobManager declares the checkpoint complete; notifies every sink.
10. Each sink's commit() makes the buffered output visible atomically.

Failure recovery
----------------
11. On crash, JobManager restarts from last COMPLETED checkpoint.
12. Source rewinds, operators restore state, sinks abort any pending txn
    that did not complete, and processing resumes deterministically.

Step-by-step trace.

Phase	Source	Operator	Sink
pre-checkpoint	reading offset 1000	processing records	buffering txn-N
barrier injected	snapshot offset 1000	aligned	preCommit txn-N
barrier propagated	—	snapshot state	preCommit done
checkpoint complete	acked	acked	commit txn-N
visible to readers	—	—	yes
crash & restore	rewind to 1000	restore state	abort txn-N+1

The trace shows that all three components (source, operator, sink) must participate in the barrier protocol for end-to-end EOS. If any one falls back to at-least-once (e.g. sink without TwoPhaseCommit), the whole job degrades to at-least-once.

Output:

Component	EOS contribution
Source	offset snapshotted with checkpoint; rewinds on restore
Operator	state snapshotted asynchronously, barrier-aligned consistency
Sink	preCommit + commit tied to checkpoint completion
JobManager	barrier coordinator, declares checkpoint complete, notifies sinks

Why this works — concept by concept:

Barrier-aligned snapshot — the Chandy-Lamport algorithm produces a globally consistent snapshot of distributed state without stopping the world. Each operator snapshots independently once it sees barriers on all input channels.
Source rewind on restore — Flink does not need exactly-once delivery from the source. It needs replayable sources. Kafka offsets, file paths, JDBC cursors are all replayable; the EOS guarantee is built on top.
TwoPhaseCommit on the sink — splits "buffer and persist" from "make visible." The visibility flip is tied to the checkpoint completion, so source rewind + sink rollback are atomic across the pipeline.
JobManager as coordinator — the JM is the single decision point. Without a centralised coordinator, distributed two-phase commit has no leader; with the JM, it is a textbook 2PC.
Cost — checkpoint overhead is O(state) for the snapshot upload (mitigated by incremental + async). Sink overhead is O(records) for the buffering. EOS-v2 latency overhead is ~10-50 ms at commit boundaries — small for batch sinks (Iceberg), more noticeable for low-latency sinks (Kafka).

Streaming
Topic — streaming
Exactly-once streaming problems

Practice →

Streaming Topic — real-time analytics Real-time analytics problems

Practice →

5. Picking the engine — co-located vs cluster

Senior engineers pick `kafka streams vs flink` from a 5-question decision tree — never from "which is faster"

The mental model in one line: the engine choice is a deployment-shape decision (library vs cluster), driven by 5 questions about source, parallelism, exactly-once surface, latency, and ops capacity — not by raw throughput numbers. Once you internalise the decision tree, the flink vs spark and apache flink vs kafka interview surface collapses to "what does your pipeline actually need?"

The 5 questions in order.

Q1 — Source. Kafka only? → Kafka Streams eligible. Multi-source (Kafka + Kinesis + JDBC CDC + S3 files)? → Flink (much richer connector library).
Q2 — Exactly-once surface. Kafka in, Kafka out, transactional? → Kafka Streams EOS-v2 is enough. EOS across non-Kafka sinks (Iceberg, JDBC, S3 files)? → Flink TwoPhaseCommitSinkFunction.
Q3 — Parallelism shape. Source rate ≈ compute rate, partition count is enough? → Kafka Streams. Per-operator parallelism differs by 4x or more? → Flink.
Q4 — Latency budget. Sub-50ms p99 with no network hop? → Kafka Streams embedded. Sub-second is fine? → either, with Flink also viable. 1s+? → Spark Structured Streaming becomes a contender.
Q5 — Ops capacity. No platform team, just microservices? → Kafka Streams. Platform team that can operate a Flink JobManager + autoscaling TaskManagers? → Flink.

Scale envelope cheat sheet.

Kafka Streams — sweet spot. ≤ 64 partitions, ≤ 1M events/sec, state ≤ ~50 GB per node, EOS-v2 to Kafka. Above that, you hit partition-count walls, changelog throughput limits, or single-node state caps.
Flink — sweet spot. 100s–1000s of operator subtasks, multi-TB state (RocksDB + remote snapshot), multi-source, EOS to non-Kafka sinks. The cluster overhead is amortised once you cross ~5 jobs.
Spark Structured Streaming — when it wins. Already-running Spark batch team, micro-batch latency (100ms–1s) is fine, unified batch + streaming code is a win. Loses on sub-second latency.

Ops cost — the hidden line item.

Kafka Streams ops. No cluster. State backups via Kafka changelog. Monitoring via JMX from your microservice. Roughly zero net-new operational surface beyond what your microservices already need.
Flink ops. JobManager HA (ZooKeeper or K8s leader election). TaskManager autoscaling (reactive mode or external operator). Savepoint storage (S3 lifecycle, retention). Web UI + REST API exposure. Checkpoint storage cost (incremental snapshots × N jobs × retention).
Spark ops. YARN / Kubernetes cluster, driver HA, RDD checkpoint storage, streaming-vs-batch resource contention if you share a cluster.

Hiring pool — the other hidden line item.

Kafka Streams developers are Java/Kotlin/Scala microservice engineers. Larger pool, lower cost.
Flink developers are streaming-specialist; market is small and expensive.
Spark developers are abundant on the batch side, sparse on the streaming side. Mixed-skill teams are common.

Common interview probes on engine selection.

"When is Kafka Streams not an option?" — multi-source pipelines, EOS to non-Kafka sinks, per-operator parallelism > partition count, state > what fits on a single node.
"When is Flink overkill?" — Kafka-only pipelines under 1M events/sec where the existing microservice deployment is fine.
"Flink vs Spark Structured Streaming?" — Flink for sub-second + true streaming + per-operator parallelism; Spark when you already have Spark batch and 100ms–1s latency is fine.
"Why not Pulsar Functions / Beam / others?" — usually because Kafka Streams and Flink have the deepest production track record and the largest community.

Worked example — Q1 source check forces the choice

Detailed explanation. A team building a real-time fraud system needs to combine a Kafka stream of transactions with a CDC stream from a Postgres users table arriving via Debezium and a file drop on S3 of merchant categories. Kafka Streams cannot connect to S3 files; Flink can.

Question. Given the source list, walk through Q1 and show why Flink is the only viable choice.

Input.

Source	Protocol
transactions	Kafka topic
users CDC	Debezium → Kafka topic
merchant_categories	S3 files (daily drop)

Code.

// Flink — three different source connectors, all in one job
DataStream<Txn> txns = env.fromSource(KafkaSource.builder()..., ws, "txns");
DataStream<User> users = env.fromSource(KafkaSource.builder()..., ws, "users-cdc");
DataStream<Category> cats = env.fromSource(
        FileSource.forRecordStreamFormat(new CsvFormat(), new Path("s3://merchants/")).build(),
        WatermarkStrategy.noWatermarks(),
        "merchants");

txns.connect(users.broadcast(userDescriptor))
    .process(new EnrichWithUser())
    .keyBy(t -> t.merchantId)
    .connect(cats.broadcast(catDescriptor))
    .process(new EnrichWithCategory())
    .sinkTo(...);

Step-by-step explanation.

Kafka Streams supports exactly one source connector: Kafka. The S3 file drop is invisible to it. You would need a separate Kafka Connect pipeline to land S3 into Kafka first, adding a hop and latency.
Flink's FileSource handles the S3 drop natively. The job is one deployment, three sources, no extra hops.
The two broadcast streams (users + categories) replicate the small lookup data to every subtask; the transactions stream is keyed and processed per merchant.
The same job would require a complete rewrite to land back on Kafka Streams; the source diversity is a hard constraint.

Output.

Source	Kafka Streams native?	Flink native?
Kafka topic	yes	yes
Debezium CDC topic	yes (it's still Kafka)	yes
S3 files	no	yes
JDBC CDC direct	no	yes
Kinesis	no	yes
Pulsar	no	yes
RabbitMQ	no	yes

Rule of thumb. Q1 alone often eliminates Kafka Streams. List every source the pipeline will touch in year 1; if any of them is not Kafka-protocol, Flink is the answer.

Worked example — Q2 EOS surface forces Flink for Iceberg sink

Detailed explanation. Your pipeline reads Kafka, computes aggregates, writes to an Iceberg table. The team needs exactly-once on the Iceberg side: no duplicate aggregate rows, no missed events even across crashes. Kafka Streams cannot do EOS to Iceberg; Flink can via the Iceberg connector's TwoPhaseCommit.

Question. Given a Kafka → aggregate → Iceberg pipeline, show why Q2 (EOS surface) pushes you to Flink.

Input.

Pipeline	EOS requirement
Kafka → aggregate → Iceberg	end-to-end exactly-once

Code.

// Flink — Iceberg sink with TwoPhaseCommit
DataStream<Agg> aggs = env.fromSource(kafkaSource, ws, "src")
                          .keyBy(e -> e.userId)
                          .window(TumblingEventTimeWindows.of(Time.hours(1)))
                          .aggregate(new HourlyAgg());

FlinkSink.forRowData(aggs.map(MyConverter::toRowData))
         .table(icebergTable)
         .tableSchema(schema)
         .writeParallelism(8)
         .upsert(false)
         .append();   // TwoPhaseCommit under the hood

env.enableCheckpointing(60_000);
env.execute("hourly-aggs-to-iceberg");

Step-by-step explanation.

The Iceberg Flink connector implements TwoPhaseCommitSinkFunction. On every checkpoint barrier, it pre-commits data files to S3 (not yet visible to Iceberg readers).
When the checkpoint completes (JobManager ack), the sink commits the Iceberg metadata pointer atomically. All buffered rows become visible at once.
On crash, recovery rewinds the Kafka source to the last completed checkpoint and re-runs. The pre-committed but not-finalised data files are detected and either retried or aborted.
Kafka Streams cannot do this — its EOS-v2 contract is strictly Kafka-to-Kafka. To write to Iceberg, it would need a sidecar pipeline (Kafka → Iceberg via Kafka Connect), and the EOS chain would break at the connector.

Output.

Sink	Kafka Streams EOS?	Flink EOS?
Kafka topic	yes (EOS-v2)	yes (KafkaSink)
Iceberg table	no	yes (FlinkSink)
JDBC	no	yes (JdbcSink TPC)
S3 files	no	yes (FileSink TPC)
Pulsar	no	yes
ClickHouse	no	yes (community connector)

Rule of thumb. If your pipeline writes to anything other than Kafka and you need exactly-once, Flink is the answer. There is no second option.

Worked example — Q5 ops capacity forces Kafka Streams

Detailed explanation. A startup has 3 backend engineers, no platform team. They need to compute a per-user real-time score from a Kafka events topic and write back to a Kafka scores topic. Flink would work, but they cannot operate a JobManager + autoscaling TaskManagers + savepoint storage + monitoring on top of their existing microservice stack. Kafka Streams piggybacks on what they already run.

Question. Given the ops constraint, why does Kafka Streams win Q5? Show the deployment shape.

Input.

Constraint	Value
Backend engineers	3
Platform team	0
Existing infra	Spring Boot + Kubernetes
Pipeline	Kafka events → score → Kafka scores

Code.

# Just a regular Spring Boot deployment — adds zero net-new ops surface
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scoring-app
spec:
  replicas: 6
  template:
    spec:
      containers:
        - name: scoring
          image: registry/scoring-app:1.4.2
          env:
            - name: KAFKA_BOOTSTRAP
              value: kafka:9092
            - name: APPLICATION_ID
              value: scoring-app
          resources:
            requests:
              cpu: 1
              memory: 2Gi
            limits:
              cpu: 2
              memory: 4Gi

Step-by-step explanation.

The Kafka Streams app is a regular Spring Boot jar deployed as a Kubernetes Deployment. Scaling = kubectl scale deploy/scoring-app --replicas=12. No JobManager, no Flink-specific CRDs.
State durability is via the changelog topic in Kafka — already running. No extra storage to operate.
Monitoring is the team's existing Prometheus + Grafana stack, with Kafka Streams metrics exposed via JMX → prometheus-jmx-exporter sidecar.
To pick Flink instead, the team would need: a JobManager HA setup, autoscaling TaskManagers (custom controller or Reactive Mode), savepoint S3 bucket with lifecycle, Flink web UI exposure, Flink-specific monitoring. A junior team cannot operate this without sustained investment.

Output.

Concern	Kafka Streams	Flink
Cluster components	0 (microservice only)	1 JobManager + N TaskManagers
HA setup	none (consumer-group rebalance)	ZooKeeper or K8s leader election
Savepoint storage	not applicable	S3 bucket + retention
Autoscaling	replica count of the Deployment	Reactive mode or custom controller
Monitoring	existing app metrics + JMX	Flink-specific dashboards

Rule of thumb. Q5 (ops capacity) is the question that most often forces Kafka Streams on small teams and most often forces Flink on large ones. Both choices are correct in their context.

Senior interview question on engine selection in 2026

A senior interviewer might frame this as: "You join a team that wants to build a real-time pricing engine. Walk me through how you'd pick between Kafka Streams, Flink, and Spark Structured Streaming. What are the trade-offs and what would push you to each?"

Solution Using the 5-question framework + engine-specific trade-offs

Engine selection — Kafka Streams vs Flink vs Spark SS
=====================================================

Q1 Source        → only Kafka? KS/Flink. Multi-source? Flink wins.
Q2 EOS surface   → Kafka-only sink? KS/Flink. Non-Kafka? Flink wins.
Q3 Parallelism   → source ≈ compute? KS/Flink. Op-parallel? Flink wins.
Q4 Latency       → < 50ms p99? KS wins. < 1s? KS/Flink. > 1s? add Spark SS.
Q5 Ops capacity  → no platform team? KS wins. Have one? Flink wins for scale.

Engine-specific trade-offs
--------------------------
Kafka Streams
  + No cluster, embedded library, simplest ops
  + EOS-v2 is mature, tight Kafka integration
  − Capped at partition count for parallelism
  − Kafka-only sinks for EOS
  − State capped by single-node disk

Flink
  + Operator-parallel scaling
  + EOS to any TwoPhaseCommit sink
  + Multi-source (Kafka + JDBC CDC + Kinesis + files)
  + Multi-TB state with remote snapshot
  − Requires platform team to operate
  − Higher operational overhead per job

Spark Structured Streaming
  + Unified batch + streaming code
  + Huge hiring pool (existing Spark batch teams)
  + Mature ML integration
  − Micro-batch by default (100ms–1s latency floor)
  − Continuous mode still less mature
  − Heavier per-record overhead than Flink

Step-by-step trace.

Pricing engine requirement	Answer	Pushes to
Q1 — Source	Kafka events + JDBC product catalog	Flink (multi-source)
Q2 — EOS surface	Kafka out + Postgres pricing snapshot	Flink (non-Kafka EOS sink)
Q3 — Parallelism	source 4 partitions, ML scoring 32 cores	Flink (per-operator)
Q4 — Latency	< 200 ms p99 for price update	Flink (sub-second OK)
Q5 — Ops	platform team exists, owns 8 other Flink jobs	Flink (cluster amortised)

Five Flink answers → Flink is the obvious pick. If Q1 had been "Kafka only" and Q5 "no platform team," the same framework would have picked Kafka Streams unambiguously.

Output:

Verdict	Reasoning
Pick Flink for the pricing engine	Multi-source + Postgres EOS sink + operator-parallel + sub-second + platform team
Don't pick Kafka Streams	Cannot handle Postgres EOS sink
Don't pick Spark SS	Latency floor too high for sub-second pricing

Why this works — concept by concept:

5 questions in order — answering Q1 first eliminates options early. If Q1 forces Flink, you don't waste time on Q2–Q5 in a Kafka Streams direction.
Engine-specific trade-offs — each engine has 2-3 unique strengths and 2-3 hard constraints. Memorising them lets you fall back to "what would push you to X" answers when an interviewer probes deeper.
Latency vs ops — Q4 and Q5 frequently conflict. Sub-second + no platform team is a hard combination; you may be forced to argue for hiring a platform engineer.
Cluster amortisation — the cost of running a Flink cluster is amortised over the number of jobs. With 8+ jobs, the marginal cost of job 9 is near zero. With 1 job, the cluster is overhead.
Cost — engine choice is a 1-year-or-more commitment; the migration cost between engines is 3-6 months of senior-engineer time. Get Q1–Q5 right the first time.

Streaming
Topic — streaming
Streaming engine trade-off problems

Practice →

Streaming
Topic — streaming · medium
Medium streaming design problems

Practice →

Cheat sheet — streaming engine recipes

When KStream is enough. Kafka in, Kafka out, EOS-v2 to Kafka, partition-bounded parallelism, microservice deployment. The 80% case for Kafka-only pipelines.
When you need Flink. Multi-source connectors (Kinesis, JDBC CDC, S3 files, Pulsar), EOS to non-Kafka sinks (Iceberg, JDBC, S3), operator-parallel scaling, multi-TB state, sub-second latency on multi-million-events/sec firehoses.
When Spark SS wins. Existing Spark batch team, unified batch+streaming code, 100ms–1s latency budget, ML integration with MLlib already in production.
Kafka Streams EOS recipe. processing.guarantee=exactly_once_v2 + replication.factor>=3 + min.in.sync.replicas>=2 on internal topics + Kafka 2.5+ brokers. Single transactional producer per stream thread.
Flink EOS recipe. env.enableCheckpointing(interval) + state backend = RocksDB + sink implements TwoPhaseCommitSinkFunction + setTransactionalIdPrefix on KafkaSink. Asynchronous incremental snapshots for large state.
State backend swap (Flink). HashMapStateBackend for ≤ 1 GB state (lowest latency, JVM-bounded); EmbeddedRocksDBStateBackend for > 1 GB state (off-heap, disk-spillable, multi-TB capable). Switch via state.backend: rocksdb in flink-conf.yaml.
Co-partitioning rule (Kafka Streams). Same number of partitions + same partitioner + same key serde on both sides of a KStream-KStream or KStream-KTable join. If any fails, insert repartition() explicitly or design the topics with matching partition counts.
Operator parallelism rule (Flink). Set .setParallelism(N) per operator; chain compatible operators (same parallelism + forward partitioning) into a single task to eliminate inter-operator serialisation. keyBy, rebalance, broadcast break chains.
Standby replicas (Kafka Streams). num.standby.replicas=1 keeps warm RocksDB copies on other instances; reduces rebalance recovery from minutes to seconds. Worth the doubled memory cost on any production app.
Savepoints (Flink). Take a savepoint before any topology change with flink savepoint <jobId> <dir>; tag every stateful operator with .uid("op-id"); restore with flink run -s <savepoint> -allowNonRestoredState.
Backpressure debug (Flink). Look for HIGH backpressure in the web UI on the operator just upstream of the slow one; either scale up that operator's parallelism or look for skewed keys.
Interactive Queries (Kafka Streams). Expose RocksDB stores via KafkaStreams#store(...) + a thin REST layer. Use StreamsMetadata to route queries to the instance owning the key's partition.
GlobalKTable rule. Use only when the lookup table is < ~1 GB per instance. Above that, switch to a regular KTable with co-partitioning or to Flink's async lookup join.

Frequently asked questions

Is Kafka Streams a competitor to Apache Flink?

Not exactly — they target different deployment shapes. Kafka Streams is a JVM library you embed inside your microservice (no separate cluster), while Flink is a cluster runtime with a JobManager and TaskManagers. Kafka Streams shines on Kafka-only pipelines with partition-bounded parallelism and a microservice ops model; Flink shines on multi-source pipelines, per-operator parallelism, exactly-once to non-Kafka sinks (Iceberg, JDBC, S3), and multi-TB state. They overlap in the 20-30% middle where both engines work, and there the team picks based on what they already operate. Treating them as "the same kind of thing" is the most common kafka streams vs flink interview mistake.

What is the difference between KStream and KTable in Kafka Streams?

A KStream is an append-only record stream where every event matters — it's the unbounded event log view. A KTable is a changelog-backed materialised view where each (key, value) upserts the previous value for that key — it's the latest-value-per-key snapshot view. They are the same data viewed two ways (the stream-table duality): KTable.toStream() converts a table to its underlying changelog stream; KStream.groupByKey().aggregate(...) converts a stream to a table. The kafka streams ktable is internally backed by a local RocksDB state store with a changelog topic for durability. There's also a GlobalKTable, which replicates the whole table on every instance for star-schema joins against small lookup tables.

Does Kafka Streams support exactly-once streaming?

Yes — set processing.guarantee=exactly_once_v2 (exactly_once_v2 since Kafka Streams 2.5). The runtime uses a single transactional producer per stream thread (instead of per task in EOS-v1), dramatically reducing transaction overhead. Each commit atomically writes state-changelog records, output records, and advances consumer offsets — all-or-nothing across input + state + output. This gives end-to-end exactly-once streaming for Kafka-to-Kafka pipelines. The caveat: the EOS guarantee only covers Kafka sinks; writing to JDBC or Iceberg from Kafka Streams breaks the chain and forces you to either accept at-least-once + idempotent writes, or move the pipeline to Flink's TwoPhaseCommitSinkFunction.

When should I use Apache Flink over Kafka Streams?

Use Flink when any of the five push-to-Flink conditions hits: (1) multi-source pipelines (Kafka + Kinesis + JDBC CDC + S3 files); (2) exactly-once to non-Kafka sinks (Iceberg, Postgres, S3, ClickHouse); (3) per-operator parallelism that differs by 4x or more from source partition count (CPU-bound enrichment, ML scoring); (4) multi-TB keyed state with remote snapshot durability; (5) you already have a platform team operating one or more Flink clusters. Use Kafka Streams for the inverse: Kafka-only pipelines, EOS-v2 to Kafka, partition-bounded parallelism, microservice deployment, no platform team. The apache flink vs kafka question almost always reduces to "library or cluster?" — and that depends on your sources, sinks, and ops capacity.

How does Flink checkpointing differ from Kafka Streams changelog replication?

Both engines keep keyed state in RocksDB, but durability is achieved differently. Kafka Streams continuously replicates every state write to a co-located changelog topic in Kafka (one record per key, compacted by default). On task move, the new owner reads the changelog from offset 0 to rebuild RocksDB — fast for small state, slow for multi-GB state (mitigated by num.standby.replicas keeping warm copies). Flink uses flink checkpointing via Chandy-Lamport barriers — the JobManager periodically injects barriers into every source, each operator snapshots its state to remote storage (S3/HDFS) when the barrier arrives, and the checkpoint is complete when every operator acks. Snapshots are incremental and asynchronous by default; recovery loads the chain of incremental snapshots. The differences matter at scale: Kafka Streams ties durability to Kafka cluster capacity; Flink ties it to remote object store capacity (cheaper for multi-TB state).

Flink vs Spark Structured Streaming — when do I pick which?

Pick Flink for true streaming (record-at-a-time), sub-second latency requirements (10–500 ms), per-operator parallelism, and EOS to non-Kafka sinks via TwoPhaseCommit. Pick Spark Structured Streaming when you already have a Spark batch team, unified batch+streaming code is a real win (DataFrame API on both sides), 100 ms–1 s micro-batch latency is fine, and MLlib integration is already in production. The 2026 flink vs spark answer: Flink wins on latency and per-operator scaling; Spark wins on hiring pool and batch-streaming code unification. Continuous-mode Spark Structured Streaming has narrowed the latency gap, but it remains less mature than micro-batch and rarely the default deployment shape.

Practice on PipeCode

Drill the streaming practice library → for the KStream / KTable / window / state family of probes.
Rehearse on medium-difficulty streaming problems → when the interviewer wants engine-design depth.
Sharpen Python streaming drills → for the watermark + window + aggregation patterns.
Stack the real-time analytics library → for the end-to-end EOS + dashboarding probes.
Layer the aggregation library → for the keyed-state + windowed-aggregate family.
For the broader surface, read top data engineering interview questions →.
Stack the prerequisites with the only 5 skills you need to become a data engineer →.
Sharpen the Spark axis with the Apache Spark internals course →.
For ETL system design, work through the ETL system design course →.

Pipecode.ai is Leetcode for Data Engineering — every Kafka Streams and Flink recipe above ships with hands-on practice rooms where you wire the KStream-KTable join, the operator-parallel JobGraph, and the TwoPhaseCommit sink against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your `kafka streams vs flink` answer holds up under a senior interviewer's depth probes.

Practice streaming now →
Real-time analytics drills →

1. Why a streaming engine choice matters in 2026

Kafka Streams is a library, Flink is a cluster — the deployment shape, not the DSL, is what makes the engineering trade-off

Worked example — same word-count, two engines

Worked example — partition parallelism vs operator parallelism

Worked example — DSL ergonomics on a join

Senior interview question on engine choice

Solution Using a 4-question decision framework

2. Kafka Streams library model

kafka streams is a JVM library, not a server — one stream task per input partition is the whole parallelism story

Worked example — KStream vs KTable on a click counter

Worked example — co-partitioning a stream-stream join

Worked example — GlobalKTable for a small dimension join

Senior interview question on Kafka Streams parallelism

Solution Using partition-count expansion + standby replicas

3. Flink JobGraph + operator chain

flink jobgraph is what your DSL becomes — operators chained, parallelism per operator, slots shared, snapshots cut by barriers

Worked example — same DSL, different parallelism per operator

Worked example — checkpoint barriers in flight

Worked example — savepoint upgrade with topology change

Senior interview question on JobGraph compilation

Solution Using the StreamGraph → JobGraph → ExecutionGraph pipeline

4. State backends + exactly-once

flink checkpointing snapshots state to remote storage; Kafka Streams replicates state to a changelog topic — different mechanisms, same exactly-once guarantee

Worked example — exactly-once in Kafka Streams

Worked example — TwoPhaseCommitSinkFunction in Flink

Worked example — RocksDB tuning for large state

Senior interview question on end-to-end exactly-once

Solution Using barrier-aligned snapshot + TwoPhaseCommit sink

5. Picking the engine — co-located vs cluster

Senior engineers pick kafka streams vs flink from a 5-question decision tree — never from "which is faster"

Worked example — Q1 source check forces the choice

Worked example — Q2 EOS surface forces Flink for Iceberg sink

Worked example — Q5 ops capacity forces Kafka Streams

Senior interview question on engine selection in 2026

Solution Using the 5-question framework + engine-specific trade-offs

Cheat sheet — streaming engine recipes

Frequently asked questions

Is Kafka Streams a competitor to Apache Flink?

What is the difference between KStream and KTable in Kafka Streams?

Does Kafka Streams support exactly-once streaming?

When should I use Apache Flink over Kafka Streams?

How does Flink checkpointing differ from Kafka Streams changelog replication?

Flink vs Spark Structured Streaming — when do I pick which?

Practice on PipeCode

`kafka streams` is a JVM library, not a server — one stream task per input partition is the whole parallelism story

`flink jobgraph` is what your DSL becomes — operators chained, parallelism per operator, slots shared, snapshots cut by barriers

`flink checkpointing` snapshots state to remote storage; Kafka Streams replicates state to a changelog topic — different mechanisms, same exactly-once guarantee

Senior engineers pick `kafka streams vs flink` from a 5-question decision tree — never from "which is faster"