kafka streams vs flink is the single biggest architecture call a streaming team makes in 2026 — and the one most teams get wrong on the first try because they treat both engines as "streaming frameworks" instead of as fundamentally different deployment shapes. Kafka Streams is a JVM library you embed inside your microservice. Flink is a cluster runtime with a JobManager, a fleet of TaskManagers, and its own scheduler. The DSL on top looks similar; the operational reality could not be more different.
This guide is the senior-DE comparison you wished existed the first time an interviewer asked "when do you pick apache flink vs kafka for stateful processing?" or "how does flink vs spark change once you need exactly-once joins on 100K events per second?" It walks through the Kafka Streams library model (KStream / KTable / GlobalKTable, per-partition stream tasks, embedded RocksDB stores backed by changelog topics), the Flink JobGraph + operator-chain runtime (StreamGraph → JobGraph → ExecutionGraph, TaskManagers, slot sharing, savepoints), state backends and exactly-once on both sides (flink checkpointing via Chandy-Lamport barriers vs kafka streams changelog replay, two-phase-commit sinks for end-to-end EOS), and the 5-question decision tree senior engineers use to pick the right engine for a new pipeline. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the streaming practice library →, rehearse on streaming medium-difficulty problems →, and sharpen the Python axis with the Python streaming drills →.
On this page
- Why a streaming engine choice matters in 2026
- Kafka Streams library model
- Flink JobGraph + operator chain
- State backends + exactly-once
- Picking the engine — co-located vs cluster
- Cheat sheet — streaming engine recipes
- Frequently asked questions
- Practice on PipeCode
1. Why a streaming engine choice matters in 2026
Kafka Streams is a library, Flink is a cluster — the deployment shape, not the DSL, is what makes the engineering trade-off
The one-sentence invariant: Kafka Streams runs as a JVM library inside your process; Flink runs as a separate cluster of JobManager + TaskManagers that your job is deployed to. Every other comparison — DSL ergonomics, state model, exactly-once semantics, scale envelope, hiring pool — follows from that one structural difference. Once you internalise "library vs cluster," the entire kafka streams vs flink interview surface collapses to a sequence of consequences from that one choice.
Three axes interviewers actually probe.
- State. Both engines keep keyed state in RocksDB. Kafka Streams replicates it continuously to a changelog topic in Kafka. Flink snapshots it periodically to a remote DFS (S3 / HDFS / GCS) via Chandy-Lamport barriers. The recovery story is different (replay-log vs restore-snapshot), and that difference dominates failover latency.
-
Exactly-once. Kafka Streams ships
processing.guarantee=exactly_once_v2which leans on Kafka's transactional producer. Flink shipsTwoPhaseCommitSinkFunctionwhich generalises the pattern to any sink that can pre-commit and commit. End-to-end EOS on both engines requires a transactional sink; getting that wrong is the most common production bug interviewers love to probe. - Scale. Kafka Streams parallelism is bounded by your input partition count — N partitions → N stream tasks, period. Flink parallelism is per-operator — a Source can run at parallelism 4, a KeyBy at 16, a Sink at 8, all in the same job. If your pipeline needs to scale operator-by-operator (the source is fast, the windowing is slow), Flink wins. If you can afford to keep partitions ≈ parallelism, Kafka Streams is dramatically simpler.
The DSL split — same name, different mental model.
-
Kafka Streams ships a Java/Scala DSL with
KStream,KTable,GlobalKTable. The unit of work is a stream task pinned to an input partition. You think in terms of "this partition has its own RocksDB store; co-partitioning makes joins work." - Flink ships a Java/Scala/Python DataStream API and a SQL layer (Flink SQL) on top. The unit of work is an operator subtask that runs inside a TaskManager slot. You think in terms of "this operator runs at parallelism N; the keyed state is partitioned by hash; a barrier cuts a global snapshot every 60s."
The 2026 reality — what changed since 2022.
-
Kafka Streams 3.x added
exactly_once_v2(single transactional producer per stream thread instead of per task), dramatically reducing the producer-fencing overhead that made EOS-v1 expensive at scale. -
Flink 1.18 / 2.0 stabilised the disaggregated state backend (
ForStDB) and made async snapshotting the default — checkpoints no longer block the operator's hot path. -
Spark Structured Streaming got real continuous-mode (sub-100ms) but still ships micro-batch as the default; the
flink vs sparkinterview answer in 2026 is "Flink for sub-second latency, Spark when you already have a Spark batch team and 100ms-1s is fine." - Apache Pulsar and Redpanda are now common Kafka-protocol replacements; Kafka Streams works on Redpanda out of the box, Flink works on Pulsar via the official connector.
What interviewers listen for.
- Do you say "Kafka Streams is a library, Flink is a cluster" in the first sentence? — senior signal.
- Do you describe state as "RocksDB local + changelog (Kafka Streams) or RocksDB local + distributed snapshot (Flink)"? — required answer.
- Do you mention "end-to-end exactly-once needs a transactional sink in both" unprompted? — senior signal.
- Do you push back on "Flink replaces Kafka Streams" with "they target different deployment shapes"? — senior signal.
Worked example — same word-count, two engines
Detailed explanation. The classic streaming Hello World — count words per minute from a topic of strings — looks deceptively similar in both engines. The DSL reads almost identically. The runtime is wildly different: Kafka Streams launches inside your JAR with no scheduler; Flink compiles the topology into a JobGraph and submits it to a cluster. Both store the per-key count in RocksDB. Only Flink lets you scale the windowing operator independently of the source parallelism.
Question. Write a 1-minute tumbling word-count over a words Kafka topic in both Kafka Streams (Java DSL) and Flink (DataStream API). Highlight the runtime difference (where does the job actually run?), the parallelism model, and the state location.
Input.
| event_time | word |
|---|---|
| 12:00:05 | kafka |
| 12:00:10 | flink |
| 12:00:30 | kafka |
| 12:01:05 | flink |
| 12:01:40 | flink |
Code.
// Kafka Streams — runs inside YOUR JVM as a library
StreamsBuilder builder = new StreamsBuilder();
builder.<String, String>stream("words")
.groupBy((k, v) -> v)
.windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(1)))
.count(Materialized.as("word-count-store"))
.toStream()
.to("word-counts", Produced.with(WindowedSerdes.timeWindowedSerdeFrom(String.class), Serdes.Long()));
KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start(); // your microservice now hosts the topology
// Flink DataStream API — submitted to a Flink cluster as a JobGraph
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.enableCheckpointing(60_000);
DataStream<String> words = env.fromSource(
KafkaSource.<String>builder().setTopics("words").build(),
WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5)),
"kafka-source");
words.keyBy(w -> w)
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.aggregate(new CountAggregator())
.sinkTo(KafkaSink.<Tuple2<String, Long>>builder()
.setTransactionalIdPrefix("wc-")
.build());
env.execute("word-count"); // submits to JobManager — cluster runs it
Step-by-step explanation.
- The DSL looks similar — both call
keyBy / groupBythenwindowthen aggregate. The difference is the call site:streams.start()in Kafka Streams launches inside the local JVM;env.execute()in Flink hands the JobGraph to a remote JobManager. - Kafka Streams creates one stream task per input partition — if
wordshas 6 partitions, exactly 6 tasks run, distributed across whatever instances of your microservice are alive. State for each task lives in RocksDB on the same machine as the task. - Flink compiles
keyBy → window → aggregateinto an operator chain, then deploys it at the configured parallelism (independent of partition count). The keyed state is partitioned by the key's hash mod parallelism; RocksDB instances live in TaskManager slots, snapshotted to S3 on every checkpoint. - Both engines emit the windowed counts at window-close (or earlier with
EMIT EARLYconfig in Kafka Streams suppress operator / Flinkincremental aggregation). Output goes back to Kafka in both cases. - The key operational difference: Kafka Streams scales by adding microservice replicas; Flink scales by adding TaskManagers and reconfiguring the JobGraph parallelism.
Output.
| window_start | window_end | word | count |
|---|---|---|---|
| 12:00:00 | 12:01:00 | kafka | 2 |
| 12:00:00 | 12:01:00 | flink | 1 |
| 12:01:00 | 12:02:00 | flink | 2 |
Rule of thumb. If the team already runs Spring Boot / Quarkus microservices, Kafka Streams piggybacks on that infra and there is no new cluster to operate. If the team needs operator-parallel scaling, multi-source connectors, or sub-second latency on a 1M events/sec firehose, the Flink cluster pays for itself.
Worked example — partition parallelism vs operator parallelism
Detailed explanation. A common interview question — "you have a Kafka topic with 4 partitions but the per-key compute is CPU-bound at 10x the source rate. How does each engine scale?" The answer is the cleanest demonstration of the library-vs-cluster split.
Question. Given a Kafka topic events with 4 partitions and a CPU-bound aggregation step, show how Kafka Streams and Flink scale the aggregation independently of the source.
Input.
| topic | partitions | source rate | per-key compute rate |
|---|---|---|---|
| events | 4 | 40K msg/sec | 4K msg/sec |
Code.
// Kafka Streams — parallelism CAPPED at partition count (4)
// To scale, you MUST increase the partition count on the source topic.
builder.stream("events")
.groupByKey()
.aggregate(...) // runs at partition parallelism = 4
.toStream()
.to("agg-output");
// Workaround: repartition first
builder.stream("events")
.selectKey((k, v) -> newKey(v))
.repartition(Repartitioned.numberOfPartitions(16)) // 16 internal partitions
.groupByKey()
.aggregate(...) // now runs at parallelism 16
.to("agg-output");
// Flink — operator parallelism is INDEPENDENT of source partitions
env.fromSource(kafkaSource, watermark, "src")
.setParallelism(4) // source = 4 (one per partition)
.keyBy(e -> e.userId)
.process(new HeavyAggregator())
.setParallelism(16) // aggregator = 16
.sinkTo(kafkaSink)
.setParallelism(8); // sink = 8
Step-by-step explanation.
- In Kafka Streams, the aggregation runs in the same stream task as the source — same partition, same JVM thread. Parallelism is fixed at partition count = 4. You cannot run 16 aggregator threads against 4 partitions; one task owns one partition end-to-end.
- The workaround is to add a
repartitionstep which writes to an internal Kafka topic with 16 partitions, then reads it back. That decouples source parallelism from downstream parallelism but adds a round-trip to Kafka (latency + storage cost). - In Flink, every operator has its own
parallelismsetting. The source runs at 4 (one subtask per partition), the heavy aggregator runs at 16, the sink runs at 8. Flink inserts a shuffle automatically — keyed network exchange — between operators with different parallelism. - The Flink scheduler maps the 16 aggregator subtasks across however many TaskManagers are available, using slot sharing to pack multiple operators into the same slot when possible.
- End result: Flink can run the heavy step at 16 threads while Kafka Streams either runs at 4 or pays for an internal repartition topic.
Output (architecture comparison).
| Aspect | Kafka Streams | Flink |
|---|---|---|
| Source parallelism | = partition count (4) | configurable (4) |
| Aggregator parallelism | = upstream parallelism unless you repartition | configurable (16) |
| Repartition cost | extra Kafka topic + round-trip | in-cluster shuffle, no Kafka |
| Scaling unit | partition count + microservice replicas | TaskManager slots |
Rule of thumb. If your per-operator compute cost varies by 4x or more, Flink's operator parallelism pays for the cluster cost. If your topology is mostly source-paced, Kafka Streams stays simpler.
Worked example — DSL ergonomics on a join
Detailed explanation. Stream-table joins are the bread and butter of streaming pipelines: enrich an event stream with a slowly-changing dimension table. Both engines support it; the syntax and the semantics differ in important ways.
Question. Write a stream-table join that enriches every orders event with the latest customers profile. Show the Kafka Streams KStream-KTable join and the Flink keyed connect + state equivalent.
Input — orders (KStream / DataStream).
| order_id | customer_id | amount |
|---|---|---|
| 1 | c1 | 50 |
| 2 | c2 | 30 |
| 3 | c1 | 75 |
Input — customers (KTable / KeyedState).
| customer_id | tier |
|---|---|
| c1 | gold |
| c2 | silver |
Code.
// Kafka Streams — natural KStream-KTable join
KTable<String, Customer> customers = builder.table("customers");
KStream<String, Order> orders = builder.stream("orders");
orders.join(customers,
(order, customer) -> enrich(order, customer))
.to("enriched-orders");
// Flink — connect two streams, keep customer in keyed broadcast state
DataStream<Order> orders = env.fromSource(orderSource, ws, "orders");
BroadcastStream<Customer> customers = env.fromSource(customerSource, ws, "customers")
.broadcast(customerDescriptor);
orders.keyBy(o -> o.customerId)
.connect(customers)
.process(new KeyedBroadcastProcessFunction<>() {
public void processElement(Order o, ReadOnlyContext ctx, Collector<Enriched> out) {
Customer c = ctx.getBroadcastState(customerDescriptor).get(o.customerId);
out.collect(enrich(o, c));
}
public void processBroadcastElement(Customer c, Context ctx, Collector<Enriched> out) {
ctx.getBroadcastState(customerDescriptor).put(c.customerId, c);
}
})
.sinkTo(enrichedSink);
Step-by-step explanation.
- Kafka Streams treats the
customerstopic as aKTable— a changelog representation where each new record upserts the value for its key. The runtime materialises the table into RocksDB on the same instance. - The
joinoperator co-partitions the streams: theorderstopic and thecustomerstopic must be co-partitioned (same key, same partition count, same partitioner). The DSL emits an internal repartition topic if co-partitioning fails. - Flink has no first-class KStream-KTable join; you build it with
connect + KeyedBroadcastProcessFunction. The broadcast stream replicates every customer record to every subtask, and each subtask keeps its own copy in broadcast state. - For small lookup tables, broadcast state is fine. For large lookup tables, use Flink's
LookupJoinagainst an external store (JDBC, HBase, async I/O) — that is the Flink idiom that has no Kafka Streams equivalent. - The output is the same: enriched orders keyed by
customer_id. The operational shape differs: Kafka Streams co-locates the lookup state; Flink lets you choose between in-memory broadcast and external lookup.
Output.
| order_id | customer_id | amount | tier |
|---|---|---|---|
| 1 | c1 | 50 | gold |
| 2 | c2 | 30 | silver |
| 3 | c1 | 75 | gold |
Rule of thumb. If your lookup is a Kafka topic with bounded cardinality, Kafka Streams KTable is the lower-friction answer. If your lookup is in a database or has > 10M keys, Flink's async lookup join is the only realistic option.
Senior interview question on engine choice
A senior interviewer often opens with: "Walk me through how you would choose between Kafka Streams and Flink for a new pipeline. What are the three or four questions you ask, in order, and what answer to each one pushes you to one engine over the other?"
Solution Using a 4-question decision framework
Decision framework — Kafka Streams vs Flink
1. Where does the data come from?
- Kafka only → Kafka Streams eligible
- Kafka + Kinesis/JDBC/files → Flink (multi-source connectors)
2. What is the parallelism shape?
- source rate ≈ compute rate, partition count is enough → Kafka Streams
- per-operator parallelism differs by 4x or more → Flink
3. What is the exactly-once contract?
- Kafka → Kafka, transactional → Kafka Streams EOS-v2 is enough
- Kafka → JDBC / S3 / Iceberg, transactional → Flink TwoPhaseCommitSinkFunction
4. What is the operational shape?
- Already running Spring Boot microservices, no platform team → Kafka Streams
- Have or want a streaming platform team running a cluster → Flink
Step-by-step trace.
| Pipeline | Q1 source | Q2 parallelism | Q3 EOS | Q4 ops | Picked |
|---|---|---|---|---|---|
| User clickstream → enrich → Kafka | Kafka | source ≈ compute | Kafka → Kafka | microservices | Kafka Streams |
| Multi-source CDC → Iceberg | Kafka + JDBC | varies | EOS to Iceberg | platform team | Flink |
| Real-time fraud (heavy ML) | Kafka | compute >> source | Kafka → Kafka | platform team | Flink |
| Per-user state machine, low scale | Kafka | source ≈ compute | Kafka → Kafka | microservices | Kafka Streams |
After the 4-question pass, the engine choice is usually unambiguous. The remaining 5% — where both engines work — defaults to whatever the team already operates.
Output:
| Engine | When it wins |
|---|---|
| Kafka Streams | Kafka-only, partition-bounded parallelism, microservice deployment, EOS-v2 to Kafka |
| Flink | Multi-source, per-operator parallelism, EOS to non-Kafka sinks, cluster-runtime team |
Why this works — concept by concept:
- Library vs cluster — the structural axis — every other consequence (state, parallelism, EOS, ops) follows from where the engine runs. Asking the deployment question first short-circuits a lot of false dichotomies.
- Partition vs operator parallelism — Kafka Streams is partition-bounded by design (one task per partition). Flink is operator-parallel (each operator has its own parallelism). If those two coincide for your workload, Kafka Streams is simpler.
-
EOS surface area — Kafka Streams EOS-v2 is bound to Kafka producers. Flink's EOS generalises via two-phase commit to any sink that implements
TwoPhaseCommitSinkFunction(JDBC, S3, Iceberg, Pulsar). - Ops cost is real — running a Flink cluster requires a platform team (JobManager HA, TaskManager autoscaling, savepoint storage, monitoring). Kafka Streams piggybacks on whatever runs your microservices.
- Cost — both engines are O(events) per record; the cost lives in operations, not throughput. Cluster TCO often dominates the cloud bill once you cross ~5 jobs.
Streaming
Topic — streaming
Streaming engine selection problems
2. Kafka Streams library model
kafka streams is a JVM library, not a server — one stream task per input partition is the whole parallelism story
The mental model in one line: a Kafka Streams app is a JAR that you run as a microservice; the runtime creates exactly one stream task per input partition, each task owns its slice of RocksDB state, and durability is provided by a per-task changelog topic written back to Kafka. Once you say that out loud, every kafka streams ktable interview question becomes a deduction from "tasks are partition-pinned and state is local."
The three core abstractions.
-
KStream — an append-only record stream. Every event is independent; "rate of events" is the only thing that matters. Operations:
filter,map,flatMap,groupBy,join,repartition. -
KTable — a changelog-backed materialised view. Each
(key, value)upserts the previous value for that key. Operations:mapValues,groupBy + aggregate,join. Internally backed by a RocksDB store + changelog topic. - GlobalKTable — a KTable replicated on every instance of the app. Used for star-schema joins against small lookup tables. No co-partitioning required (joins don't need matching partitions on the stream side).
The stream-table duality.
- A table is a snapshot of the latest value per key.
- A stream is the sequence of changes that produced the table.
- They are the same data, viewed two ways.
toStream()converts a KTable into the underlying changelog stream;groupByKey().aggregate()converts a KStream into a KTable.
Stream tasks — the unit of parallelism.
- One task per input partition. A topic with 12 partitions → 12 stream tasks. Period.
-
Tasks are scheduled across
StreamThreads within a single JVM, and across multiple instances of the app. You scale horizontally by adding instances; Kafka Streams' consumer-group protocol rebalances tasks across the alive instances. - A task owns its slice of RocksDB. No cross-task state access. Co-partitioning rules force every input topic the task reads to share the same partition.
The changelog topic — durability without a separate store.
- Every write to a state store is also written to an internal changelog topic with name
appId-storeName-changelog. Compacted by design (one record per key). - On task restart on a different machine, the new owner replays the changelog from offset 0 to rebuild RocksDB.
- For large stores, this can take minutes — Kafka Streams ships
standby replicasconfig to keep warm copies on other instances, slashing recovery time.
Co-partitioning rules (the rule book interviewers love).
-
Same number of partitions on both input topics for a
KStream-KStreamorKStream-KTablejoin. - Same partitioner (default is murmur2 hash; if one side was written with a custom partitioner, joins fail).
- Same key serde — the join uses the key bytes for equality.
- If any rule fails, Kafka Streams either errors at build time or transparently inserts a
repartitionstep (extra internal topic + round-trip to Kafka).
Interactive Queries — reading state out of the running app.
- Kafka Streams exposes
KafkaStreams#store(...)to query the local RocksDB stores from the running app's JVM. - A small HTTP layer (you write it; Kafka Streams gives you
StreamsMetadatato route queries) turns the app into a queryable cache. - This is the "no separate KV store" pattern many teams use to avoid running a separate Redis / DynamoDB for read-side state.
Common interview probes on the library model.
- "What is the relationship between input partitions and stream tasks?" — strictly 1:1.
- "What happens to state when a task moves to another instance?" — replay the changelog; warm with standby replicas.
- "When does Kafka Streams insert an internal repartition topic?" — when co-partitioning fails or when you call
selectKey/groupBywith a different key. - "What is a
GlobalKTablefor?" — small lookup tables that you join from any partition without co-partitioning.
Worked example — KStream vs KTable on a click counter
Detailed explanation. A clicks topic emits (user_id, page_id) records. You want two views: the raw stream of clicks (for downstream pipelines) and the running click count per user (for a dashboard). KStream is the stream view; KTable is the aggregated view.
Question. Build both views on the same clicks topic. Show the DSL, the underlying state store, and how the two outputs differ.
Input.
| time | user_id | page_id |
|---|---|---|
| 12:00 | u1 | /home |
| 12:01 | u2 | /home |
| 12:02 | u1 | /pricing |
| 12:03 | u1 | /home |
Code.
StreamsBuilder builder = new StreamsBuilder();
// KStream view — every click forwarded as-is
KStream<String, String> clicks = builder.stream("clicks");
clicks.to("clicks-out");
// KTable view — running click count per user
KTable<String, Long> clickCount = clicks
.groupByKey()
.count(Materialized.as("click-count-store"));
clickCount.toStream().to("click-count-out", Produced.with(Serdes.String(), Serdes.Long()));
Step-by-step explanation.
-
clicks.to("clicks-out")is a pure stream pass-through — every input record produces one output record. -
groupByKey().count()builds aKTablematerialised as theclick-count-storeRocksDB store. Each input record updates the store and emits an update record to the changelog and the output topic. - After
u1 /homeat 12:00, the store hasu1 → 1and emits(u1, 1)to the count topic. - After
u1 /pricingat 12:02, the store updates tou1 → 2and emits(u1, 2). The previous(u1, 1)record is not deleted — it is superseded by the upsert in a compacted topic. - After
u1 /homeat 12:03, store updates tou1 → 3. Final state per user is the latest value per key in the changelog.
Output (KStream view — clicks-out).
| user_id | page_id |
|---|---|
| u1 | /home |
| u2 | /home |
| u1 | /pricing |
| u1 | /home |
Output (KTable view — click-count-out, compacted to latest).
| user_id | count |
|---|---|
| u1 | 3 |
| u2 | 1 |
Rule of thumb. If consumers want every event, model as KStream. If consumers want the latest value per key, model as KTable and emit to a compacted output topic. The compacted topic is a free read-side cache.
Worked example — co-partitioning a stream-stream join
Detailed explanation. A common bug: an interviewer asks you to join orders and payments by order_id, both written from different microservices. The orders topic has 12 partitions; the payments topic has 8. The naive orders.join(payments, ...) fails to compile (or worse, runs but produces wrong results in some versions) because co-partitioning is broken.
Question. Write a KStream-KStream join between orders and payments keyed on order_id, given that the two topics have different partition counts. Show the fix.
Input — orders (12 partitions).
| order_id | amount |
|---|---|
| o1 | 50 |
| o2 | 30 |
Input — payments (8 partitions).
| order_id | paid |
|---|---|
| o1 | 50 |
| o2 | 30 |
Code.
KStream<String, Order> orders = builder.stream("orders");
KStream<String, Payment> payments = builder.stream("payments");
// BROKEN — co-partitioning fails (12 vs 8 partitions)
// orders.join(payments, ..., JoinWindows.ofTimeDifferenceWithNoGrace(Duration.ofMinutes(5)))
// .to("matched");
// Fix 1 — explicit repartition to align partition counts
KStream<String, Payment> paymentsRepart = payments
.repartition(Repartitioned.<String, Payment>as("payments-repart")
.withNumberOfPartitions(12));
orders.join(paymentsRepart,
(o, p) -> new Match(o.id, o.amount, p.paid),
JoinWindows.ofTimeDifferenceWithNoGrace(Duration.ofMinutes(5)))
.to("matched");
Step-by-step explanation.
- Kafka Streams checks co-partitioning at build time. If both sides come from topics with mismatched partition counts (or different partitioners), it throws a
TopologyExceptioninstructing you torepartition. -
payments.repartition(...)writes the stream to an internal Kafka topic with 12 partitions, keyed by the originalorder_id. The outputKStreamof that operator reads from the new internal topic. - Now both sides are co-partitioned: the same
order_idlands in the same partition number on both streams, so the same stream task can join the two locally without network shuffles. - The join uses a
JoinWindow— orders and payments within 5 minutes of each other match. The state is kept in a windowed RocksDB store (one per task) backed by a changelog. - The cost: one extra Kafka round-trip per payment record. If the topics will live forever, the cleaner fix is to make the producer write to 12 partitions in the first place so no internal repartition is ever needed.
Output.
| order_id | amount | paid |
|---|---|---|
| o1 | 50 | 50 |
| o2 | 30 | 30 |
Rule of thumb. Whenever you design a topic that will participate in joins, pick a partition count that matches every counterparty topic. The internal repartition topic is a real latency and storage cost; eliminate it at the schema design stage.
Worked example — GlobalKTable for a small dimension join
Detailed explanation. A clicks stream needs enrichment with a small geo lookup (~200 country codes). Co-partitioning a 200-row table to match the 100-partition clicks topic is wasteful. GlobalKTable replicates the whole table on every instance, so any partition can do the lookup locally.
Question. Enrich a clicks KStream with country_name from a small country_codes topic using a GlobalKTable. Show why this avoids the co-partitioning constraint.
Input — clicks (100 partitions).
| user_id | country_code |
|---|---|
| u1 | US |
| u2 | DE |
Input — country_codes (1 partition).
| country_code | country_name |
|---|---|
| US | United States |
| DE | Germany |
Code.
GlobalKTable<String, String> countries = builder.globalTable("country_codes");
KStream<String, Click> clicks = builder.stream("clicks");
clicks.join(countries,
(clickKey, click) -> click.countryCode, // key extractor on the stream side
(click, countryName) -> click.withCountry(countryName))
.to("clicks-enriched");
Step-by-step explanation.
-
builder.globalTable("country_codes")materialises the whole topic into a local RocksDB store on every instance of the app. All 200 country codes live on every node. - The
joinsignature takes a key extractor (because the stream key isuser_idbut the join key iscountry_code). NoselectKeyis needed; no internal repartition topic is created. - Every stream task — regardless of its partition assignment — can look up any country code from its local copy of the global table. The lookup is a single RocksDB get.
- The trade-off: every instance pays the memory cost of the full table. Worth it for 200 rows; catastrophic for 200M rows.
Output.
| user_id | country_name |
|---|---|
| u1 | United States |
| u2 | Germany |
Rule of thumb. Use GlobalKTable only when the lookup table fits in memory on every node (rule of thumb: < ~1 GB compacted). For larger lookups, use a regular KTable with co-partitioning, or hop out to Flink + async lookup join.
Senior interview question on Kafka Streams parallelism
A senior interviewer might ask: "Your Kafka Streams app processes 200K msg/sec from a 4-partition topic. You've added 10 instances of the app and CPU is still pegged on 4 of them. What is happening and how do you fix it?"
Solution Using partition-count expansion + standby replicas
// Diagnosis: 4 partitions = max 4 stream tasks = max 4 CPU cores doing work
// The other 6 instances are idle (or running standby replicas).
// Fix 1 — increase partition count on the source topic to 24
// (one-time, via kafka-topics --alter --partitions 24)
// Fix 2 — increase num.stream.threads per instance to use more cores per JVM
props.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 6);
// Fix 3 — keep 1 standby replica per task to shorten recovery on rebalance
props.put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 1);
Step-by-step trace.
| Step | Before fix | After fix |
|---|---|---|
| Topic partitions | 4 | 24 |
| Max stream tasks | 4 | 24 |
| Active instances doing work | 4 / 10 | up to 24 / 10 |
| Threads per instance | 1 | 6 |
| Standby replicas per task | 0 | 1 |
| Recovery on rebalance | minutes (replay full changelog) | seconds (warm RocksDB) |
After the fix, the 10 instances run 24 tasks distributed roughly 2-3 per instance. With num.stream.threads=6 and 2-3 tasks per instance, all 6 threads have work; CPU is balanced.
Output:
| Metric | Before | After |
|---|---|---|
| Throughput | 200K msg/sec (capped) | 1.2M msg/sec (linear scale) |
| Idle instances | 6 | 0 |
| Recovery time | ~3 min | ~10 sec |
| Internal topic count | 0 | 0 (no repartition added) |
Why this works — concept by concept:
- Partition count is the throughput ceiling — Kafka Streams creates exactly one stream task per partition. You cannot run more tasks than partitions, no matter how many instances you add. Increasing partitions is the only way to grow the work pool.
-
num.stream.threads = cores per JVM — once the topic has enough partitions, each instance can run multiple tasks in parallel. Setting
num.stream.threads≈ CPU cores maxes out the JVM. -
Standby replicas warm RocksDB — without standbys, a rebalance forces the new task owner to replay the whole changelog from offset 0. With
num.standby.replicas=1, one other instance keeps a hot copy; rebalances take seconds. - No internal repartition needed — the fix changes partition count once on the source topic, not in the topology, so no internal repartition topic is created. Lower latency and storage cost.
- Cost — increasing partitions is permanent and irreversible; pick a count that covers 3-5 years of growth. Per-record CPU cost is unchanged; recovery cost drops from O(changelog size) to O(delta since last standby tick).
Streaming
Topic — streaming
Kafka Streams parallelism problems
3. Flink JobGraph + operator chain
flink jobgraph is what your DSL becomes — operators chained, parallelism per operator, slots shared, snapshots cut by barriers
The mental model in one line: a Flink program is a StreamGraph (your DSL), compiled to a JobGraph (chained operators), then deployed as an ExecutionGraph (parallel subtasks) onto TaskManager slots managed by a JobManager. Once you say "the JobGraph chains operators and each operator has its own parallelism," the entire apache flink vs kafka operator-parallelism interview surface becomes a sequence of consequences from that compilation pipeline.
The compilation pipeline.
-
StreamGraph — the raw user DSL: every operator is its own node, every edge is a network connection. Built client-side from
env.fromSource(...)... .keyBy(...)... .sinkTo(...). - JobGraph — the optimised graph: adjacent operators with the same parallelism and forward partitioning are chained into a single task. Chaining eliminates the serialisation between operators.
- ExecutionGraph — the parallel deployment: each operator's parallelism setting expands into N subtasks; the JobManager assigns subtasks to TaskManager slots.
The cluster topology.
- JobManager — one master process per job (HA mode: standby JobManagers via ZooKeeper / K8s). Owns the checkpoint coordinator, the scheduler, and the job's metadata.
- TaskManager — worker processes, one or more. Each TaskManager runs a fixed number of slots (configurable, typically = CPU cores). Each slot can host multiple operator subtasks via slot sharing.
- Slot sharing groups — by default, every operator in a job belongs to the same slot-sharing group, which lets multiple operators co-locate in the same slot (less network shuffle, more efficient).
Parallelism — per-operator, not per-job.
- Every operator has its own
setParallelism(N). Without an explicit setting, Flink uses the job's default parallelism. - A
keyByreshuffles records by hash so that all records with the same key go to the same downstream subtask. Keyed state is partitioned the same way. - Mismatched parallelism between adjacent operators triggers a rebalance / hash / rescale partitioning between them.
Operator state vs keyed state.
- Operator state (broadcast, list, union) — per-subtask state independent of key. Useful for connector offsets, broadcast lookups.
-
Keyed state (ValueState, ListState, MapState, ReducingState) — per-key state inside a
KeyedStream. The runtime guarantees that all records with the same key go to the same subtask, so keyed state access is local. -
State backend —
HashMapStateBackend(heap, JVM-bounded) orEmbeddedRocksDBStateBackend(off-heap, disk-bounded). Snapshotted on checkpoint.
Checkpoints vs savepoints — the operational split.
-
Checkpoints — automatic, periodic, fast restore on failure. Internal format; tied to a job execution. Configured via
env.enableCheckpointing(interval). - Savepoints — user-triggered, portable, versioned snapshots used for planned restarts, upgrades, or topology migrations. Survive cross-Flink-version restores.
- Both are produced by the same Chandy-Lamport barrier mechanism, but savepoints get extra metadata and are intentionally kept user-owned.
Backpressure — credit-based flow control.
- Each consumer subtask grants the upstream producer a credit for every free buffer slot. The producer cannot send beyond granted credits.
- When a downstream operator is slow, its credit pool dries up; the upstream operator stops sending and slows down its own input. The pressure propagates back to the source.
- Flink's web UI exposes backpressure per operator — "HIGH" backpressure on an operator means that operator is the bottleneck.
Common interview probes on the JobGraph.
- "What is the difference between StreamGraph and JobGraph?" — StreamGraph is the raw DSL; JobGraph is the optimised chained graph.
- "What is operator chaining?" — adjacent operators with same parallelism and forward partitioning run in the same task without serialisation between them.
- "What is the difference between a checkpoint and a savepoint?" — checkpoints are automatic, internal, and tied to a running job; savepoints are user-triggered and portable across Flink versions.
- "What is slot sharing?" — multiple operator subtasks share a single TaskManager slot to reduce network shuffle and pack more parallelism per machine.
Worked example — same DSL, different parallelism per operator
Detailed explanation. A Flink job reads from a 4-partition Kafka topic, runs a heavy enrichment step, then writes to a 4-partition sink. The enrichment is CPU-bound; the team wants it at 16-way parallelism while keeping the source and sink at 4. Operator chaining and parallelism settings make this trivial in Flink.
Question. Write a Flink job with source parallelism 4, enrichment parallelism 16, and sink parallelism 4. Show the JobGraph nodes and which subtasks live in which TaskManager slot.
Input.
| Operator | Parallelism | Notes |
|---|---|---|
| Kafka source | 4 | one subtask per partition |
| Map (filter) | 4 | chained with source |
| Heavy enrichment | 16 | CPU-bound, needs scale |
| Kafka sink | 4 | matches output partition count |
Code.
DataStreamSource<Event> src = env.fromSource(
KafkaSource.<Event>builder().setTopics("events").build(),
WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5)),
"events-src")
.setParallelism(4);
DataStream<Event> filtered = src
.filter(e -> e.kind != null)
.setParallelism(4); // chains with source
DataStream<Enriched> enriched = filtered
.keyBy(e -> e.userId) // hash partition by userId
.process(new HeavyEnrichmentFn())
.setParallelism(16); // CPU-bound, scale out
enriched.sinkTo(KafkaSink.<Enriched>builder()
.setTransactionalIdPrefix("enriched-")
.build())
.setParallelism(4); // matches sink partition count
Step-by-step explanation.
-
env.fromSource(...).setParallelism(4)creates 4 source subtasks, one per Kafka partition. Each subtask reads from its assigned partition and emits records downstream. -
.filter(...).setParallelism(4)runs at the same parallelism as the source with forward partitioning, so Flink chains the source and filter into a single task: no serialisation between them, both run in the same TaskManager slot. -
.keyBy(...)is a hash partitioning operation. It breaks the chain — the next operator has parallelism 16 (different from upstream 4), so Flink inserts a network shuffle that hash-partitions records byuserIdacross 16 downstream subtasks. -
.process(new HeavyEnrichmentFn()).setParallelism(16)runs the CPU-bound function at 16-way parallelism. Each subtask handles ~1/16 of the keys. -
.sinkTo(...).setParallelism(4)runs the sink at 4. Another shuffle from 16 → 4 (rebalance, round-robin by default). - Slot sharing means the 4 source-filter subtasks, the 16 enrichment subtasks, and the 4 sink subtasks can co-locate in slots if there is room. The JobManager packs them efficiently across the available TaskManager slots.
Output (JobGraph shape).
| Node | Parallelism | Chained with | Shuffle in |
|---|---|---|---|
| Source + Filter | 4 | (chained) | none |
| Heavy Enrichment | 16 | (own task) | hash by userId |
| Sink | 4 | (own task) | rebalance |
Rule of thumb. Use setParallelism per operator on any DAG where compute cost varies by step. Use the same parallelism for chained sequential operators (source → filter → map) to keep them in one task with zero serialisation overhead.
Worked example — checkpoint barriers in flight
Detailed explanation. Flink's checkpoint mechanism is the Chandy-Lamport algorithm adapted to streaming. The JobManager periodically injects a barrier into every source subtask. Each operator forwards the barrier to its outputs only after persisting its own state for that barrier. When the sinks see the barrier, the checkpoint is complete.
Question. Trace a single checkpoint through a 3-operator pipeline (Source → KeyBy + Aggregate → Sink) at parallelism 2. Show what each subtask does at the moment a barrier passes through it.
Input.
| Time | Event |
|---|---|
| t=0 | barrier-7 injected by JobManager to both source subtasks |
Code.
env.enableCheckpointing(60_000); // every 60s
env.getCheckpointConfig().setCheckpointStorage("s3://flink/cp");
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(5_000);
env.getCheckpointConfig().setCheckpointTimeout(120_000);
env.getCheckpointConfig().enableUnalignedCheckpoints();
// Setting it once on the env applies to every operator.
Step-by-step explanation.
- JobManager's CheckpointCoordinator generates checkpoint id 7 and sends a "trigger" RPC to every source subtask.
- Each source subtask records its read offset (e.g. Kafka offset 12345 for partition 0, 12678 for partition 1), persists the offset to the state backend, then emits a checkpoint barrier (a special record containing checkpoint id 7) into the output stream.
- The KeyBy + Aggregate operator receives the barrier from one of its input channels. It buffers further records on that channel until the barrier arrives on all input channels (this is "barrier alignment" in aligned mode).
- Once aligned, the operator atomically snapshots its keyed state (the per-key running aggregates) to the state backend, then forwards the barrier downstream.
- With unaligned checkpoints, step 3 changes: the operator does not wait for barrier alignment; it captures in-flight records on the slower channels into a side buffer and snapshots immediately. Trades extra state size for lower checkpoint latency under backpressure.
- The sink receives the barrier, completes any pending transactional commit, and acknowledges to the JobManager. Once all sinks acknowledge, checkpoint 7 is complete and is the new restore point.
- If any subtask fails before completion, the JobManager restarts the job from checkpoint 6 (or whichever was the last completed one).
Output (per-subtask state at barrier-7 completion).
| Subtask | Persisted state |
|---|---|
| Source-0 | Kafka offset 12345 |
| Source-1 | Kafka offset 12678 |
| Aggregate-0 | RocksDB snapshot for keys hash mod 2 == 0 |
| Aggregate-1 | RocksDB snapshot for keys hash mod 2 == 1 |
| Sink-0 | last-committed Kafka producer transaction id |
| Sink-1 | last-committed Kafka producer transaction id |
Rule of thumb. Aligned checkpoints are correct but slow under backpressure. Enable unaligned checkpoints on any job with bursty load or skewed keys — the extra state size is small compared to the checkpoint-timeout you would otherwise hit.
Worked example — savepoint upgrade with topology change
Detailed explanation. A live Flink job needs a new operator added between source and the existing aggregate. You cannot just redeploy — the new job's operator IDs differ, so checkpoint restore fails. The pattern is: take a savepoint, stop the old job, add uid() tags to every stateful operator, redeploy from the savepoint with the new topology.
Question. A Flink job runs in production. The team wants to insert a new filter operator before the aggregate. Show the savepoint + restart sequence and the uid() tagging discipline that makes it work.
Input.
| Old DAG | New DAG |
|---|---|
| Source → Aggregate → Sink | Source → Filter → Aggregate → Sink |
Code.
# 1) Take a savepoint of the running job
flink savepoint <jobId> s3://flink/savepoints/my-job
// 2) Redeploy with a NEW operator + explicit uid() tags
DataStreamSource<Event> src = env.fromSource(kafkaSource, ws, "src")
.uid("src-1") // STABLE id across restarts
.setParallelism(4);
DataStream<Event> filtered = src
.filter(e -> e.kind != null)
.uid("filter-1") // NEW operator — fresh id
.setParallelism(4);
DataStream<Agg> aggregated = filtered
.keyBy(e -> e.userId)
.process(new AggregateFn())
.uid("agg-1") // SAME id as before
.setParallelism(16);
aggregated.sinkTo(kafkaSink)
.uid("sink-1") // SAME id as before
.setParallelism(4);
# 3) Start from the savepoint
flink run -s s3://flink/savepoints/my-job/savepoint-... -allowNonRestoredState my-job.jar
Step-by-step explanation.
- The savepoint captures every operator's state by operator uid. Without an explicit
uid(), Flink generates one from the operator's position in the DAG — any topology change rotates the uid and breaks restore. - Tagging every stateful operator with a stable
uid()makes the topology change safe: the runtime maps savepoint state back to the matching uid even if the DAG order changed. - The new
filter-1operator has a fresh uid; it has no prior state to restore. Flink's-allowNonRestoredStateflag prevents the restore from erroring on the new operator. - The Aggregate operator's uid is unchanged → its RocksDB state from the old job is restored into the same operator in the new job.
- Once running, the new job resumes processing from the savepoint's offsets with the new filter step inserted.
Output (state mapping during restore).
| Savepoint operator | New job operator | Mapped? |
|---|---|---|
| src-1 | src-1 | yes |
| — | filter-1 (new) | no (fresh state) |
| agg-1 | agg-1 | yes |
| sink-1 | sink-1 | yes |
Rule of thumb. Tag every stateful operator with an explicit .uid(...) from day one. The first time you need a topology upgrade, you will be glad you did. Without it, you take a downtime for state reset.
Senior interview question on JobGraph compilation
A senior interviewer might ask: "Explain what happens to a Flink program between env.execute() and the moment a record is actually processed on a TaskManager. What graphs get built, what optimisations happen, and what is the role of each component?"
Solution Using the StreamGraph → JobGraph → ExecutionGraph pipeline
Client side
-----------
1. User builds DataStream API → StreamGraph
2. StreamGraph optimised (chain compatible operators) → JobGraph
3. JobGraph + JAR submitted via REST to JobManager
JobManager side
---------------
4. JobManager receives JobGraph
5. Each JobVertex expanded by parallelism → ExecutionGraph (parallel subtasks)
6. Scheduler requests slots from ResourceManager (one slot per subtask, modulo slot sharing)
7. Subtasks deployed to TaskManager slots; checkpoint coordinator initialised
TaskManager side
----------------
8. TaskManagers receive the TaskDeploymentDescriptor
9. Each slot loads the user code, sets up the state backend, opens network channels
10. Source subtasks begin reading; records flow through the operator chain
Step-by-step trace.
| Stage | Where | Output |
|---|---|---|
| 1 | Client JVM | StreamGraph (user-DSL DAG) |
| 2 | Client JVM | JobGraph (chained operators) |
| 3 | Client → REST | JAR + JobGraph submitted |
| 4 | JobManager | JobGraph parsed |
| 5 | JobManager | ExecutionGraph (subtasks per operator) |
| 6 | JobManager + RM | Slot offers granted |
| 7 | JobManager | Subtasks deployed |
| 8 | TaskManager | TDD received |
| 9 | TaskManager | Operator chains initialised |
| 10 | TaskManager | Records flow |
The pipeline distinguishes three graphs because each stage has its own purpose: StreamGraph is the user model, JobGraph is the optimised deployment unit, ExecutionGraph is the runtime parallel form.
Output:
| Component | Owns | Talks to |
|---|---|---|
| Client | StreamGraph → JobGraph | JobManager (REST) |
| JobManager | JobGraph + ExecutionGraph + CheckpointCoordinator | TaskManagers (RPC) |
| ResourceManager | Slot pool | JobManager + TaskManagers |
| TaskManager | Slots + user code + state backend | JobManager (heartbeat + checkpoints) |
Why this works — concept by concept:
- Three graphs, three concerns — StreamGraph is what you wrote, JobGraph is what gets deployed, ExecutionGraph is what actually runs. Each stage is a refinement; you do not jump straight from DSL to subtasks.
-
Operator chaining at JobGraph — chain-compatible operators (same parallelism, forward partitioning) fuse into a single task. Eliminates serialisation between them. Caveat: chaining is broken by
keyBy,rebalance,broadcast. - Slot sharing at ExecutionGraph — by default, every operator in a job shares slots. Lets the scheduler pack source + map + sink subtasks for the same parallelism slice into the same slot, reducing total slot demand.
- Checkpoint coordinator lives in JobManager — barriers originate from the JM and acknowledgements return to the JM. The JM is the single point of consistency for the job.
- Cost — compilation and submission are O(operators); runtime is O(events). The bookkeeping is constant per checkpoint, so checkpoint overhead is bounded.
Streaming
Topic — streaming
Flink JobGraph problems
4. State backends + exactly-once
flink checkpointing snapshots state to remote storage; Kafka Streams replicates state to a changelog topic — different mechanisms, same exactly-once guarantee
The mental model in one line: both engines keep keyed state in RocksDB; Kafka Streams replicates every state change to a Kafka changelog topic for durability, while Flink snapshots state periodically to remote storage via Chandy-Lamport barriers — and both engines achieve end-to-end exactly-once by pairing this state durability with a transactional sink. Once you say "durability mechanism differs, EOS semantics align," the exactly-once streaming interview surface becomes a deduction from those two halves.
Kafka Streams state durability — changelog replay.
- Every write to a state store goes to a co-located changelog topic in Kafka with name
<application.id>-<store-name>-changelog. - The changelog topic is compacted by default — exactly one record per key, so it converges to the size of the state store, not the lifetime stream.
- On task move (rebalance or restart), the new owner reads the changelog topic from offset 0 to rebuild the RocksDB store.
-
Standby replicas keep warm copies on other instances. With
num.standby.replicas=1, recovery is O(delta since last warm-up tick) instead of O(full changelog).
Flink state durability — distributed snapshot.
- The state backend is either
HashMapStateBackend(heap-only, JVM-bounded) orEmbeddedRocksDBStateBackend(off-heap + disk, multi-TB capable). - On every checkpoint, the state is snapshotted to a configured
CheckpointStorage(file URI:s3://,hdfs://,gs://). - The snapshot is incremental in RocksDB mode — only the SST files that changed since the last checkpoint are uploaded. Restoration loads the chain of incremental snapshots.
- The snapshot is asynchronous by default — the operator hot path keeps processing while the state is copied to remote storage in the background.
The EOS mode ladder.
- At-least-once — the default for both engines if you don't enable EOS. Records are processed and committed; on failure, the source replays from the last committed offset, so some records can be reprocessed.
- Idempotent — duplicates are produced but the sink dedupes (e.g. unique constraint on output, upsert by primary key). End-state-exactly-once even if events are not.
- Transactional / EOS-v2 — sources rewind to the last committed checkpoint, operators replay deterministically, and the sink commits atomically with the checkpoint. No duplicates ever leave the job.
Kafka Streams EOS-v2.
- Set
processing.guarantee=exactly_once_v2inStreamsConfig. - The runtime uses a single transactional producer per stream thread (instead of per task in EOS-v1) — dramatically lower transaction overhead.
- Each
committicks a Kafka transaction across input offsets + state changelog writes + output topic writes — all-or-nothing. - Requires Kafka 2.5+ on the broker side and
min.in.sync.replicas≥ 2 for the transactional__transaction_statetopic.
Flink TwoPhaseCommitSinkFunction.
- An abstract class with three methods:
beginTransaction(),preCommit(txn),commit(txn). - On every checkpoint barrier, the sink calls
preCommit— buffer flushed but not yet visible. - On checkpoint completion, the JobManager notifies all sinks; each sink calls
committo publish the buffered batch atomically. - Concrete implementations:
KafkaSink(Kafka transactional producer),FileSink(temp file → atomic rename),JdbcSink(transaction), Iceberg sink (commit a new snapshot).
The three EOS failure modes interviewers ask about.
- Source replay without idempotence on the sink. Sources rewind on failure; the sink sees duplicates. Caught by: "what is your sink's idempotence story?"
-
State not in the snapshot. A new operator (or a missing
uid()) means state was not restored; counts double on restart. Caught by: "do you tag every stateful operator withuid()?" - Sink commit not atomic with the checkpoint. Sink writes commit before the source acks → duplicates. Caught by: "does your sink implement TwoPhaseCommit?"
Worked example — exactly-once in Kafka Streams
Detailed explanation. A Kafka Streams app reads from orders, computes a running per-user total, writes to totals. The team wants exactly-once: no duplicate totals, no missed orders, even across crashes.
Question. Configure a Kafka Streams app for EOS-v2 and explain what the runtime now does differently. Show the config and the producer-side transaction shape.
Input.
| Topic | Role |
|---|---|
| orders | source |
| totals | sink |
| -running-totals-changelog | internal state durability |
Code.
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "totals-app");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka:9092");
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, "exactly_once_v2");
props.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 3); // for internal topics
StreamsBuilder builder = new StreamsBuilder();
builder.<String, Order>stream("orders")
.groupBy((k, v) -> v.userId)
.aggregate(() -> 0L,
(k, v, agg) -> agg + v.amount,
Materialized.<String, Long>as("running-totals"))
.toStream()
.to("totals", Produced.with(Serdes.String(), Serdes.Long()));
KafkaStreams app = new KafkaStreams(builder.build(), props);
app.start();
Step-by-step explanation.
- With
exactly_once_v2, every stream thread holds a single transactional producer. The transactional id is derived deterministically fromapplication.id+ stream-thread id. - On each
commit.interval.mstick (default 100ms with EOS), the runtime opens a transaction. Within the transaction it: reads input offsets, writes state-changelog records, writes output records, advances the consumer's committed offsets. - On commit, all four writes are atomic. Either all are visible or none — no duplicates, no lost updates.
- On crash, the new owner of the task fences the old transactional producer (Kafka's transactional fencing protocol) and starts a fresh transaction from the last committed offset. The state store rebuilds from the changelog up to that offset.
- The price of EOS-v2 is higher latency at commit boundaries (transactions add a round-trip) and the requirement that the Kafka brokers support transactions (
isr=2+,min.in.sync.replicas=2).
Output (per-transaction atomic write set).
| Write | Visible if commit succeeds |
|---|---|
| Updated state-changelog records | yes |
Output records to totals
|
yes |
Advanced consumer offsets on orders
|
yes |
| All-or-nothing on failure | yes |
Rule of thumb. Use EOS-v2 by default on any Kafka-only pipeline. The latency overhead (~10ms at commit) is negligible compared to the engineering cost of a custom dedup pipeline.
Worked example — TwoPhaseCommitSinkFunction in Flink
Detailed explanation. A Flink job reads from Kafka, computes hourly aggregates, and writes them to Iceberg. The team wants end-to-end exactly-once. Flink's KafkaSource is rewind-safe on its own; the Iceberg sink must implement TwoPhaseCommit so that a checkpoint and the Iceberg commit happen atomically.
Question. Show how Flink's TwoPhaseCommit sink integrates with checkpointing. Walk through the four phases for a single checkpoint.
Input.
| Stage | Action |
|---|---|
| Checkpoint barrier injected | t=0 |
| Sink preCommit | t=1 |
| Checkpoint complete | t=2 |
| Sink commit (notifyCheckpointComplete) | t=3 |
Code.
public class IcebergTwoPhaseSink extends TwoPhaseCommitSinkFunction<Row, TxnHandle, Void> {
@Override
protected TxnHandle beginTransaction() {
// open a new Iceberg writer; returns a handle
return new TxnHandle(icebergTable.newAppend());
}
@Override
protected void invoke(TxnHandle txn, Row row, Context ctx) {
txn.append(row); // buffered to local files
}
@Override
protected void preCommit(TxnHandle txn) {
txn.flushFilesToObjectStore(); // visible to txn only
}
@Override
protected void commit(TxnHandle txn) {
txn.appendOp.commit(); // atomic Iceberg metadata flip
}
@Override
protected void abort(TxnHandle txn) {
txn.deleteUncommittedFiles();
}
}
Step-by-step explanation.
- The sink begins a transaction the moment a checkpoint barrier triggers (after the previous commit). All records between barriers go into this transaction.
- On the barrier,
preCommitis called: data files are flushed to object storage but not yet visible to the table (no metadata update). The transaction is durable but pending. - The checkpoint metadata (transaction id, snapshot pointers) is included in the checkpoint state and persisted to remote storage.
- Once the JobManager confirms checkpoint completion, every sink instance receives
notifyCheckpointComplete→ callscommit→ flips the Iceberg metadata pointer atomically. The new data is now visible to all readers. - If the job crashes between
preCommitandcommit, recovery re-reads the last completed checkpoint and finds the pending transaction. It either retries the commit (idempotent in Iceberg) or aborts and re-runs from the prior checkpoint.
Output (Iceberg snapshot visibility timeline).
| Time | Sink state | Iceberg snapshot |
|---|---|---|
| t=0 | begin txn N | snapshot N-1 |
| t=0..1 | records buffered | snapshot N-1 |
| t=1 | preCommit (files flushed) | snapshot N-1 |
| t=2 | checkpoint complete | snapshot N-1 |
| t=3 | commit (metadata flip) | snapshot N |
Rule of thumb. Implement TwoPhaseCommit on any sink that supports atomic publish (transactional Kafka, Iceberg, JDBC, S3 with rename). For sinks that do not, fall back to idempotent writes with a dedup column.
Worked example — RocksDB tuning for large state
Detailed explanation. Both engines use RocksDB under the hood. For multi-TB state, RocksDB tuning becomes the dominant operational concern: block cache size, write buffer, compaction triggers. The defaults work for low-GB state; large state needs hand-tuning.
Question. A Flink job has a 500GB keyed state. Default RocksDB settings cause high latency on writes and slow recovery. Show the key config knobs and how each one affects performance.
Input.
| State size | Default RocksDB issue |
|---|---|
| 500 GB / TaskManager | small block cache → cache miss on every read |
Code.
# flink-conf.yaml — large-state RocksDB tuning
state.backend: rocksdb
state.backend.incremental: true
state.backend.rocksdb.localdir: /mnt/nvme/rocksdb # fast SSD
# Memory budget per slot (off-heap)
state.backend.rocksdb.memory.managed: true
state.backend.rocksdb.memory.fixed-per-slot: 4 GB
# Write buffer + L0 trigger
state.backend.rocksdb.writebuffer.size: 256 MB
state.backend.rocksdb.writebuffer.count: 4
# Block cache + bloom filters
state.backend.rocksdb.block.cache-size: 2 GB
state.backend.rocksdb.block.bloomfilter: true
# Compaction
state.backend.rocksdb.compaction.level.use-dynamic-size: true
state.backend.rocksdb.thread.num: 4
Step-by-step explanation.
-
state.backend.rocksdb.localdiron a fast NVMe drive avoids the EBS/EFS latency penalty. RocksDB is I/O-heavy; SSD is non-negotiable. -
memory.managed=truelets Flink budget RocksDB memory off-heap, preventing OOM on the JVM heap.fixed-per-slotpartitions it predictably. -
writebuffer.size + writebuffer.countcontrols L0 flush frequency. Larger buffers reduce write amplification at the cost of higher memory. -
block.cache-sizecaches the most-read SST blocks. With a 2 GB cache hitting 80%+ of reads, latency drops 10x compared to the default 8 MB cache. -
bloomfilterskips disk reads for keys that definitely do not exist in an SST. Especially useful for point-lookup keyed state. -
compaction.level.use-dynamic-size + thread.numlets compaction scale with write volume; without it, L0 backs up under bursty writes and stalls foreground writes.
Output.
| Tuning | Default | Tuned | Effect |
|---|---|---|---|
| block.cache-size | 8 MB | 2 GB | 10x read latency drop |
| writebuffer.size | 64 MB | 256 MB | 4x lower write amplification |
| localdir | EBS gp2 | NVMe | 5x I/O latency drop |
| memory.managed | heap | off-heap | no OOM on heavy state |
Rule of thumb. Provision NVMe local disks for any Flink job with >100 GB state. Use the managed-memory off-heap config; never let RocksDB share JVM heap with operator code. Recover times scale linearly with state size — keep checkpoints incremental and small.
Senior interview question on end-to-end exactly-once
A senior interviewer might ask: "Walk me through, end-to-end, how exactly-once works in Flink. What does the source do, what does the operator do, what does the sink do, and what is the JobManager's role? Where can it break?"
Solution Using barrier-aligned snapshot + TwoPhaseCommit sink
End-to-end exactly-once in Flink
================================
Source side
-----------
1. Kafka source records its read offsets as part of operator state.
2. On checkpoint, the offset becomes part of the snapshot.
3. On restore, source rewinds to the snapshot's offset.
Operator side
-------------
4. Stateful operator's keyed state lives in RocksDB.
5. On barrier, operator buffers any extra input until barrier alignment
(or with unaligned checkpoints, captures in-flight records into the snapshot).
6. Operator snapshots its state to remote storage (S3/HDFS) asynchronously.
Sink side (TwoPhaseCommit)
--------------------------
7. On barrier, sink's preCommit() flushes buffered output to durable but
not-yet-visible storage (e.g. Kafka transactional producer pre-commit,
Iceberg pending snapshot).
8. JobManager waits for every operator + sink to ack the barrier.
9. JobManager declares the checkpoint complete; notifies every sink.
10. Each sink's commit() makes the buffered output visible atomically.
Failure recovery
----------------
11. On crash, JobManager restarts from last COMPLETED checkpoint.
12. Source rewinds, operators restore state, sinks abort any pending txn
that did not complete, and processing resumes deterministically.
Step-by-step trace.
| Phase | Source | Operator | Sink |
|---|---|---|---|
| pre-checkpoint | reading offset 1000 | processing records | buffering txn-N |
| barrier injected | snapshot offset 1000 | aligned | preCommit txn-N |
| barrier propagated | — | snapshot state | preCommit done |
| checkpoint complete | acked | acked | commit txn-N |
| visible to readers | — | — | yes |
| crash & restore | rewind to 1000 | restore state | abort txn-N+1 |
The trace shows that all three components (source, operator, sink) must participate in the barrier protocol for end-to-end EOS. If any one falls back to at-least-once (e.g. sink without TwoPhaseCommit), the whole job degrades to at-least-once.
Output:
| Component | EOS contribution |
|---|---|
| Source | offset snapshotted with checkpoint; rewinds on restore |
| Operator | state snapshotted asynchronously, barrier-aligned consistency |
| Sink | preCommit + commit tied to checkpoint completion |
| JobManager | barrier coordinator, declares checkpoint complete, notifies sinks |
Why this works — concept by concept:
- Barrier-aligned snapshot — the Chandy-Lamport algorithm produces a globally consistent snapshot of distributed state without stopping the world. Each operator snapshots independently once it sees barriers on all input channels.
- Source rewind on restore — Flink does not need exactly-once delivery from the source. It needs replayable sources. Kafka offsets, file paths, JDBC cursors are all replayable; the EOS guarantee is built on top.
- TwoPhaseCommit on the sink — splits "buffer and persist" from "make visible." The visibility flip is tied to the checkpoint completion, so source rewind + sink rollback are atomic across the pipeline.
- JobManager as coordinator — the JM is the single decision point. Without a centralised coordinator, distributed two-phase commit has no leader; with the JM, it is a textbook 2PC.
- Cost — checkpoint overhead is O(state) for the snapshot upload (mitigated by incremental + async). Sink overhead is O(records) for the buffering. EOS-v2 latency overhead is ~10-50 ms at commit boundaries — small for batch sinks (Iceberg), more noticeable for low-latency sinks (Kafka).
Streaming
Topic — streaming
Exactly-once streaming problems
5. Picking the engine — co-located vs cluster
Senior engineers pick kafka streams vs flink from a 5-question decision tree — never from "which is faster"
The mental model in one line: the engine choice is a deployment-shape decision (library vs cluster), driven by 5 questions about source, parallelism, exactly-once surface, latency, and ops capacity — not by raw throughput numbers. Once you internalise the decision tree, the flink vs spark and apache flink vs kafka interview surface collapses to "what does your pipeline actually need?"
The 5 questions in order.
- Q1 — Source. Kafka only? → Kafka Streams eligible. Multi-source (Kafka + Kinesis + JDBC CDC + S3 files)? → Flink (much richer connector library).
- Q2 — Exactly-once surface. Kafka in, Kafka out, transactional? → Kafka Streams EOS-v2 is enough. EOS across non-Kafka sinks (Iceberg, JDBC, S3 files)? → Flink TwoPhaseCommitSinkFunction.
- Q3 — Parallelism shape. Source rate ≈ compute rate, partition count is enough? → Kafka Streams. Per-operator parallelism differs by 4x or more? → Flink.
- Q4 — Latency budget. Sub-50ms p99 with no network hop? → Kafka Streams embedded. Sub-second is fine? → either, with Flink also viable. 1s+? → Spark Structured Streaming becomes a contender.
- Q5 — Ops capacity. No platform team, just microservices? → Kafka Streams. Platform team that can operate a Flink JobManager + autoscaling TaskManagers? → Flink.
Scale envelope cheat sheet.
- Kafka Streams — sweet spot. ≤ 64 partitions, ≤ 1M events/sec, state ≤ ~50 GB per node, EOS-v2 to Kafka. Above that, you hit partition-count walls, changelog throughput limits, or single-node state caps.
- Flink — sweet spot. 100s–1000s of operator subtasks, multi-TB state (RocksDB + remote snapshot), multi-source, EOS to non-Kafka sinks. The cluster overhead is amortised once you cross ~5 jobs.
- Spark Structured Streaming — when it wins. Already-running Spark batch team, micro-batch latency (100ms–1s) is fine, unified batch + streaming code is a win. Loses on sub-second latency.
Ops cost — the hidden line item.
- Kafka Streams ops. No cluster. State backups via Kafka changelog. Monitoring via JMX from your microservice. Roughly zero net-new operational surface beyond what your microservices already need.
- Flink ops. JobManager HA (ZooKeeper or K8s leader election). TaskManager autoscaling (reactive mode or external operator). Savepoint storage (S3 lifecycle, retention). Web UI + REST API exposure. Checkpoint storage cost (incremental snapshots × N jobs × retention).
- Spark ops. YARN / Kubernetes cluster, driver HA, RDD checkpoint storage, streaming-vs-batch resource contention if you share a cluster.
Hiring pool — the other hidden line item.
- Kafka Streams developers are Java/Kotlin/Scala microservice engineers. Larger pool, lower cost.
- Flink developers are streaming-specialist; market is small and expensive.
- Spark developers are abundant on the batch side, sparse on the streaming side. Mixed-skill teams are common.
Common interview probes on engine selection.
- "When is Kafka Streams not an option?" — multi-source pipelines, EOS to non-Kafka sinks, per-operator parallelism > partition count, state > what fits on a single node.
- "When is Flink overkill?" — Kafka-only pipelines under 1M events/sec where the existing microservice deployment is fine.
- "Flink vs Spark Structured Streaming?" — Flink for sub-second + true streaming + per-operator parallelism; Spark when you already have Spark batch and 100ms–1s latency is fine.
- "Why not Pulsar Functions / Beam / others?" — usually because Kafka Streams and Flink have the deepest production track record and the largest community.
Worked example — Q1 source check forces the choice
Detailed explanation. A team building a real-time fraud system needs to combine a Kafka stream of transactions with a CDC stream from a Postgres users table arriving via Debezium and a file drop on S3 of merchant categories. Kafka Streams cannot connect to S3 files; Flink can.
Question. Given the source list, walk through Q1 and show why Flink is the only viable choice.
Input.
| Source | Protocol |
|---|---|
| transactions | Kafka topic |
| users CDC | Debezium → Kafka topic |
| merchant_categories | S3 files (daily drop) |
Code.
// Flink — three different source connectors, all in one job
DataStream<Txn> txns = env.fromSource(KafkaSource.builder()..., ws, "txns");
DataStream<User> users = env.fromSource(KafkaSource.builder()..., ws, "users-cdc");
DataStream<Category> cats = env.fromSource(
FileSource.forRecordStreamFormat(new CsvFormat(), new Path("s3://merchants/")).build(),
WatermarkStrategy.noWatermarks(),
"merchants");
txns.connect(users.broadcast(userDescriptor))
.process(new EnrichWithUser())
.keyBy(t -> t.merchantId)
.connect(cats.broadcast(catDescriptor))
.process(new EnrichWithCategory())
.sinkTo(...);
Step-by-step explanation.
- Kafka Streams supports exactly one source connector: Kafka. The S3 file drop is invisible to it. You would need a separate Kafka Connect pipeline to land S3 into Kafka first, adding a hop and latency.
- Flink's FileSource handles the S3 drop natively. The job is one deployment, three sources, no extra hops.
- The two broadcast streams (users + categories) replicate the small lookup data to every subtask; the transactions stream is keyed and processed per merchant.
- The same job would require a complete rewrite to land back on Kafka Streams; the source diversity is a hard constraint.
Output.
| Source | Kafka Streams native? | Flink native? |
|---|---|---|
| Kafka topic | yes | yes |
| Debezium CDC topic | yes (it's still Kafka) | yes |
| S3 files | no | yes |
| JDBC CDC direct | no | yes |
| Kinesis | no | yes |
| Pulsar | no | yes |
| RabbitMQ | no | yes |
Rule of thumb. Q1 alone often eliminates Kafka Streams. List every source the pipeline will touch in year 1; if any of them is not Kafka-protocol, Flink is the answer.
Worked example — Q2 EOS surface forces Flink for Iceberg sink
Detailed explanation. Your pipeline reads Kafka, computes aggregates, writes to an Iceberg table. The team needs exactly-once on the Iceberg side: no duplicate aggregate rows, no missed events even across crashes. Kafka Streams cannot do EOS to Iceberg; Flink can via the Iceberg connector's TwoPhaseCommit.
Question. Given a Kafka → aggregate → Iceberg pipeline, show why Q2 (EOS surface) pushes you to Flink.
Input.
| Pipeline | EOS requirement |
|---|---|
| Kafka → aggregate → Iceberg | end-to-end exactly-once |
Code.
// Flink — Iceberg sink with TwoPhaseCommit
DataStream<Agg> aggs = env.fromSource(kafkaSource, ws, "src")
.keyBy(e -> e.userId)
.window(TumblingEventTimeWindows.of(Time.hours(1)))
.aggregate(new HourlyAgg());
FlinkSink.forRowData(aggs.map(MyConverter::toRowData))
.table(icebergTable)
.tableSchema(schema)
.writeParallelism(8)
.upsert(false)
.append(); // TwoPhaseCommit under the hood
env.enableCheckpointing(60_000);
env.execute("hourly-aggs-to-iceberg");
Step-by-step explanation.
- The Iceberg Flink connector implements
TwoPhaseCommitSinkFunction. On every checkpoint barrier, it pre-commits data files to S3 (not yet visible to Iceberg readers). - When the checkpoint completes (JobManager ack), the sink commits the Iceberg metadata pointer atomically. All buffered rows become visible at once.
- On crash, recovery rewinds the Kafka source to the last completed checkpoint and re-runs. The pre-committed but not-finalised data files are detected and either retried or aborted.
- Kafka Streams cannot do this — its EOS-v2 contract is strictly Kafka-to-Kafka. To write to Iceberg, it would need a sidecar pipeline (Kafka → Iceberg via Kafka Connect), and the EOS chain would break at the connector.
Output.
| Sink | Kafka Streams EOS? | Flink EOS? |
|---|---|---|
| Kafka topic | yes (EOS-v2) | yes (KafkaSink) |
| Iceberg table | no | yes (FlinkSink) |
| JDBC | no | yes (JdbcSink TPC) |
| S3 files | no | yes (FileSink TPC) |
| Pulsar | no | yes |
| ClickHouse | no | yes (community connector) |
Rule of thumb. If your pipeline writes to anything other than Kafka and you need exactly-once, Flink is the answer. There is no second option.
Worked example — Q5 ops capacity forces Kafka Streams
Detailed explanation. A startup has 3 backend engineers, no platform team. They need to compute a per-user real-time score from a Kafka events topic and write back to a Kafka scores topic. Flink would work, but they cannot operate a JobManager + autoscaling TaskManagers + savepoint storage + monitoring on top of their existing microservice stack. Kafka Streams piggybacks on what they already run.
Question. Given the ops constraint, why does Kafka Streams win Q5? Show the deployment shape.
Input.
| Constraint | Value |
|---|---|
| Backend engineers | 3 |
| Platform team | 0 |
| Existing infra | Spring Boot + Kubernetes |
| Pipeline | Kafka events → score → Kafka scores |
Code.
# Just a regular Spring Boot deployment — adds zero net-new ops surface
apiVersion: apps/v1
kind: Deployment
metadata:
name: scoring-app
spec:
replicas: 6
template:
spec:
containers:
- name: scoring
image: registry/scoring-app:1.4.2
env:
- name: KAFKA_BOOTSTRAP
value: kafka:9092
- name: APPLICATION_ID
value: scoring-app
resources:
requests:
cpu: 1
memory: 2Gi
limits:
cpu: 2
memory: 4Gi
Step-by-step explanation.
- The Kafka Streams app is a regular Spring Boot jar deployed as a Kubernetes Deployment. Scaling =
kubectl scale deploy/scoring-app --replicas=12. No JobManager, no Flink-specific CRDs. - State durability is via the changelog topic in Kafka — already running. No extra storage to operate.
- Monitoring is the team's existing Prometheus + Grafana stack, with Kafka Streams metrics exposed via JMX → prometheus-jmx-exporter sidecar.
- To pick Flink instead, the team would need: a JobManager HA setup, autoscaling TaskManagers (custom controller or Reactive Mode), savepoint S3 bucket with lifecycle, Flink web UI exposure, Flink-specific monitoring. A junior team cannot operate this without sustained investment.
Output.
| Concern | Kafka Streams | Flink |
|---|---|---|
| Cluster components | 0 (microservice only) | 1 JobManager + N TaskManagers |
| HA setup | none (consumer-group rebalance) | ZooKeeper or K8s leader election |
| Savepoint storage | not applicable | S3 bucket + retention |
| Autoscaling | replica count of the Deployment | Reactive mode or custom controller |
| Monitoring | existing app metrics + JMX | Flink-specific dashboards |
Rule of thumb. Q5 (ops capacity) is the question that most often forces Kafka Streams on small teams and most often forces Flink on large ones. Both choices are correct in their context.
Senior interview question on engine selection in 2026
A senior interviewer might frame this as: "You join a team that wants to build a real-time pricing engine. Walk me through how you'd pick between Kafka Streams, Flink, and Spark Structured Streaming. What are the trade-offs and what would push you to each?"
Solution Using the 5-question framework + engine-specific trade-offs
Engine selection — Kafka Streams vs Flink vs Spark SS
=====================================================
Q1 Source → only Kafka? KS/Flink. Multi-source? Flink wins.
Q2 EOS surface → Kafka-only sink? KS/Flink. Non-Kafka? Flink wins.
Q3 Parallelism → source ≈ compute? KS/Flink. Op-parallel? Flink wins.
Q4 Latency → < 50ms p99? KS wins. < 1s? KS/Flink. > 1s? add Spark SS.
Q5 Ops capacity → no platform team? KS wins. Have one? Flink wins for scale.
Engine-specific trade-offs
--------------------------
Kafka Streams
+ No cluster, embedded library, simplest ops
+ EOS-v2 is mature, tight Kafka integration
− Capped at partition count for parallelism
− Kafka-only sinks for EOS
− State capped by single-node disk
Flink
+ Operator-parallel scaling
+ EOS to any TwoPhaseCommit sink
+ Multi-source (Kafka + JDBC CDC + Kinesis + files)
+ Multi-TB state with remote snapshot
− Requires platform team to operate
− Higher operational overhead per job
Spark Structured Streaming
+ Unified batch + streaming code
+ Huge hiring pool (existing Spark batch teams)
+ Mature ML integration
− Micro-batch by default (100ms–1s latency floor)
− Continuous mode still less mature
− Heavier per-record overhead than Flink
Step-by-step trace.
| Pricing engine requirement | Answer | Pushes to |
|---|---|---|
| Q1 — Source | Kafka events + JDBC product catalog | Flink (multi-source) |
| Q2 — EOS surface | Kafka out + Postgres pricing snapshot | Flink (non-Kafka EOS sink) |
| Q3 — Parallelism | source 4 partitions, ML scoring 32 cores | Flink (per-operator) |
| Q4 — Latency | < 200 ms p99 for price update | Flink (sub-second OK) |
| Q5 — Ops | platform team exists, owns 8 other Flink jobs | Flink (cluster amortised) |
Five Flink answers → Flink is the obvious pick. If Q1 had been "Kafka only" and Q5 "no platform team," the same framework would have picked Kafka Streams unambiguously.
Output:
| Verdict | Reasoning |
|---|---|
| Pick Flink for the pricing engine | Multi-source + Postgres EOS sink + operator-parallel + sub-second + platform team |
| Don't pick Kafka Streams | Cannot handle Postgres EOS sink |
| Don't pick Spark SS | Latency floor too high for sub-second pricing |
Why this works — concept by concept:
- 5 questions in order — answering Q1 first eliminates options early. If Q1 forces Flink, you don't waste time on Q2–Q5 in a Kafka Streams direction.
- Engine-specific trade-offs — each engine has 2-3 unique strengths and 2-3 hard constraints. Memorising them lets you fall back to "what would push you to X" answers when an interviewer probes deeper.
- Latency vs ops — Q4 and Q5 frequently conflict. Sub-second + no platform team is a hard combination; you may be forced to argue for hiring a platform engineer.
- Cluster amortisation — the cost of running a Flink cluster is amortised over the number of jobs. With 8+ jobs, the marginal cost of job 9 is near zero. With 1 job, the cluster is overhead.
- Cost — engine choice is a 1-year-or-more commitment; the migration cost between engines is 3-6 months of senior-engineer time. Get Q1–Q5 right the first time.
Streaming
Topic — streaming
Streaming engine trade-off problems
Streaming
Topic — streaming · medium
Medium streaming design problems
Cheat sheet — streaming engine recipes
- When KStream is enough. Kafka in, Kafka out, EOS-v2 to Kafka, partition-bounded parallelism, microservice deployment. The 80% case for Kafka-only pipelines.
- When you need Flink. Multi-source connectors (Kinesis, JDBC CDC, S3 files, Pulsar), EOS to non-Kafka sinks (Iceberg, JDBC, S3), operator-parallel scaling, multi-TB state, sub-second latency on multi-million-events/sec firehoses.
- When Spark SS wins. Existing Spark batch team, unified batch+streaming code, 100ms–1s latency budget, ML integration with MLlib already in production.
-
Kafka Streams EOS recipe.
processing.guarantee=exactly_once_v2+replication.factor>=3+min.in.sync.replicas>=2on internal topics + Kafka 2.5+ brokers. Single transactional producer per stream thread. -
Flink EOS recipe.
env.enableCheckpointing(interval)+ state backend = RocksDB + sink implementsTwoPhaseCommitSinkFunction+setTransactionalIdPrefixon KafkaSink. Asynchronous incremental snapshots for large state. -
State backend swap (Flink).
HashMapStateBackendfor ≤ 1 GB state (lowest latency, JVM-bounded);EmbeddedRocksDBStateBackendfor > 1 GB state (off-heap, disk-spillable, multi-TB capable). Switch viastate.backend: rocksdbinflink-conf.yaml. -
Co-partitioning rule (Kafka Streams). Same number of partitions + same partitioner + same key serde on both sides of a KStream-KStream or KStream-KTable join. If any fails, insert
repartition()explicitly or design the topics with matching partition counts. -
Operator parallelism rule (Flink). Set
.setParallelism(N)per operator; chain compatible operators (same parallelism + forward partitioning) into a single task to eliminate inter-operator serialisation.keyBy,rebalance,broadcastbreak chains. -
Standby replicas (Kafka Streams).
num.standby.replicas=1keeps warm RocksDB copies on other instances; reduces rebalance recovery from minutes to seconds. Worth the doubled memory cost on any production app. -
Savepoints (Flink). Take a savepoint before any topology change with
flink savepoint <jobId> <dir>; tag every stateful operator with.uid("op-id"); restore withflink run -s <savepoint> -allowNonRestoredState. - Backpressure debug (Flink). Look for HIGH backpressure in the web UI on the operator just upstream of the slow one; either scale up that operator's parallelism or look for skewed keys.
-
Interactive Queries (Kafka Streams). Expose RocksDB stores via
KafkaStreams#store(...)+ a thin REST layer. UseStreamsMetadatato route queries to the instance owning the key's partition. - GlobalKTable rule. Use only when the lookup table is < ~1 GB per instance. Above that, switch to a regular KTable with co-partitioning or to Flink's async lookup join.
Frequently asked questions
Is Kafka Streams a competitor to Apache Flink?
Not exactly — they target different deployment shapes. Kafka Streams is a JVM library you embed inside your microservice (no separate cluster), while Flink is a cluster runtime with a JobManager and TaskManagers. Kafka Streams shines on Kafka-only pipelines with partition-bounded parallelism and a microservice ops model; Flink shines on multi-source pipelines, per-operator parallelism, exactly-once to non-Kafka sinks (Iceberg, JDBC, S3), and multi-TB state. They overlap in the 20-30% middle where both engines work, and there the team picks based on what they already operate. Treating them as "the same kind of thing" is the most common kafka streams vs flink interview mistake.
What is the difference between KStream and KTable in Kafka Streams?
A KStream is an append-only record stream where every event matters — it's the unbounded event log view. A KTable is a changelog-backed materialised view where each (key, value) upserts the previous value for that key — it's the latest-value-per-key snapshot view. They are the same data viewed two ways (the stream-table duality): KTable.toStream() converts a table to its underlying changelog stream; KStream.groupByKey().aggregate(...) converts a stream to a table. The kafka streams ktable is internally backed by a local RocksDB state store with a changelog topic for durability. There's also a GlobalKTable, which replicates the whole table on every instance for star-schema joins against small lookup tables.
Does Kafka Streams support exactly-once streaming?
Yes — set processing.guarantee=exactly_once_v2 (exactly_once_v2 since Kafka Streams 2.5). The runtime uses a single transactional producer per stream thread (instead of per task in EOS-v1), dramatically reducing transaction overhead. Each commit atomically writes state-changelog records, output records, and advances consumer offsets — all-or-nothing across input + state + output. This gives end-to-end exactly-once streaming for Kafka-to-Kafka pipelines. The caveat: the EOS guarantee only covers Kafka sinks; writing to JDBC or Iceberg from Kafka Streams breaks the chain and forces you to either accept at-least-once + idempotent writes, or move the pipeline to Flink's TwoPhaseCommitSinkFunction.
When should I use Apache Flink over Kafka Streams?
Use Flink when any of the five push-to-Flink conditions hits: (1) multi-source pipelines (Kafka + Kinesis + JDBC CDC + S3 files); (2) exactly-once to non-Kafka sinks (Iceberg, Postgres, S3, ClickHouse); (3) per-operator parallelism that differs by 4x or more from source partition count (CPU-bound enrichment, ML scoring); (4) multi-TB keyed state with remote snapshot durability; (5) you already have a platform team operating one or more Flink clusters. Use Kafka Streams for the inverse: Kafka-only pipelines, EOS-v2 to Kafka, partition-bounded parallelism, microservice deployment, no platform team. The apache flink vs kafka question almost always reduces to "library or cluster?" — and that depends on your sources, sinks, and ops capacity.
How does Flink checkpointing differ from Kafka Streams changelog replication?
Both engines keep keyed state in RocksDB, but durability is achieved differently. Kafka Streams continuously replicates every state write to a co-located changelog topic in Kafka (one record per key, compacted by default). On task move, the new owner reads the changelog from offset 0 to rebuild RocksDB — fast for small state, slow for multi-GB state (mitigated by num.standby.replicas keeping warm copies). Flink uses flink checkpointing via Chandy-Lamport barriers — the JobManager periodically injects barriers into every source, each operator snapshots its state to remote storage (S3/HDFS) when the barrier arrives, and the checkpoint is complete when every operator acks. Snapshots are incremental and asynchronous by default; recovery loads the chain of incremental snapshots. The differences matter at scale: Kafka Streams ties durability to Kafka cluster capacity; Flink ties it to remote object store capacity (cheaper for multi-TB state).
Flink vs Spark Structured Streaming — when do I pick which?
Pick Flink for true streaming (record-at-a-time), sub-second latency requirements (10–500 ms), per-operator parallelism, and EOS to non-Kafka sinks via TwoPhaseCommit. Pick Spark Structured Streaming when you already have a Spark batch team, unified batch+streaming code is a real win (DataFrame API on both sides), 100 ms–1 s micro-batch latency is fine, and MLlib integration is already in production. The 2026 flink vs spark answer: Flink wins on latency and per-operator scaling; Spark wins on hiring pool and batch-streaming code unification. Continuous-mode Spark Structured Streaming has narrowed the latency gap, but it remains less mature than micro-batch and rarely the default deployment shape.
Practice on PipeCode
- Drill the streaming practice library → for the KStream / KTable / window / state family of probes.
- Rehearse on medium-difficulty streaming problems → when the interviewer wants engine-design depth.
- Sharpen Python streaming drills → for the watermark + window + aggregation patterns.
- Stack the real-time analytics library → for the end-to-end EOS + dashboarding probes.
- Layer the aggregation library → for the keyed-state + windowed-aggregate family.
- For the broader surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Sharpen the Spark axis with the Apache Spark internals course →.
- For ETL system design, work through the ETL system design course →.
Pipecode.ai is Leetcode for Data Engineering — every Kafka Streams and Flink recipe above ships with hands-on practice rooms where you wire the KStream-KTable join, the operator-parallel JobGraph, and the TwoPhaseCommit sink against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your `kafka streams vs flink` answer holds up under a senior interviewer's depth probes.





Top comments (0)