apache kafka interview questions show up in nearly every senior data engineering loop because Kafka is the messaging substrate underneath modern streaming pipelines, change data capture, and real-time analytics. Interviewers don't stop at "what is Kafka?" — they probe whether you understand kafka partitions as the unit of parallelism, kafka isr as the durability gate, kafka consumer group rebalance as the operational pain point, and kafka exactly once semantics as the producer-to-consumer atomic contract.
This guide walks through the eight Kafka primitives that show up most often in data engineering interview questions at FAANG and streaming-heavy shops (Uber, Netflix, LinkedIn, Stripe, DoorDash). Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works. By the end you'll be able to defend acks=all with min.insync.replicas=2, trace a cooperative-sticky rebalance across four consumers, and explain transactional consume-transform-produce with isolation.level=read_committed — the exact shape kafka data engineer interview rounds reward when kraft kafka and kafka transactions come up.
When you want hands-on reps immediately after reading, browse streaming practice library →, drill real-time analytics drills →, and rehearse on Python streaming problems →.
On this page
- Why Kafka shows up in every senior data engineering interview
- Topics, partitions, and ordering — the unit of parallelism
- Replication, ISR, and the durability gate
- Producer semantics — acks, idempotence, batching, compression
- Consumer groups, offsets, and cooperative-sticky rebalancing
- Exactly-once semantics and Kafka transactions
- Kafka Connect, Schema Registry, and Kafka Streams — the ecosystem
- Choosing the right Kafka primitive (cheat sheet)
- Frequently asked questions
- Practice on PipeCode
1. Why Kafka shows up in every senior data engineering interview
Kafka is the durable log between every producer and every consumer in a modern pipeline
The one-sentence invariant: Kafka is a distributed, partitioned, replicated commit log — producers append records, consumers read them at their own pace, and the log persists for a configurable retention window so multiple consumers can replay it independently. Once you internalise that mental model — log first, queue and pub-sub as derived behaviours — the rest of the apache kafka interview questions family becomes a sequence of trade-off discussions: durability vs latency, ordering vs throughput, exactly-once vs raw speed.
Kafka vs traditional message queue in five bullets.
- Lifecycle. RabbitMQ / SQS delete a message after it is acknowledged. Kafka keeps the message for the topic's retention window (hours, days, or forever) — multiple consumer groups read the same log independently.
- Pull vs push. Kafka is pull-based — consumers fetch at their own pace. Traditional MQs push, which forces backpressure to be solved at the broker.
- Throughput. Kafka is built around sequential disk writes and zero-copy network transfers — it sustains millions of writes per second on commodity hardware.
- Replay. Resetting an offset is a first-class operation; replaying a day of events is one CLI command. RabbitMQ would require a re-publish.
- Fanout patterns. One topic, N consumer groups → each group sees every message. Same topic, one consumer group with M consumers → partitions are sharded across M readers (queue semantics on a log substrate).
The three primitives every Kafka interview opens with.
- Broker — a Kafka server. A cluster has 3+ brokers for production; one is elected controller for cluster-wide metadata operations.
- Topic — a logical name for a stream of records. A topic is a directory of one or more partitions on disk.
- Partition — an ordered, immutable, append-only log. Partitions are the unit of parallelism (consumer count is bounded by partition count) and the unit of ordering (records inside one partition are strictly ordered).
The 2026 reality (Kafka 4.x).
-
KRaft mode is the only mode. As of Kafka 4.0 ZooKeeper is fully removed. Metadata is managed by a Raft consensus quorum inside the brokers —
KAFKA_PROCESS_ROLES=broker,controllerfor combined nodes,KAFKA_CONTROLLER_QUORUM_VOTERSfor the voter set. - Cooperative-sticky rebalancing is the default consumer protocol. No more stop-the-world consumer pauses on every group change.
- Share Groups (Kafka 4.2) introduce true queue semantics — partitions can be read by multiple consumers concurrently, no per-partition ownership. Different problem, different tool — covered briefly in §5.
What interviewers listen for.
- Do you reach for
acks=all+min.insync.replicas=2the moment durability comes up? — senior signal. - Do you explain ordering as per-partition only, not topic-wide? — required answer.
- Do you mention idempotent producer and transactions when "exactly-once" is asked? — required answer.
- Do you reach for cooperative-sticky when explaining rebalancing? — current-default signal.
Worked example — design the ingest layer for a clickstream pipeline
Detailed explanation. Realistic Kafka design starts from the throughput target ("100k events/sec at peak"), the ordering requirement ("per-session ordering matters"), and the downstream consumer count ("two real-time consumers, one warehouse sink").
Question. Design the Kafka topic for a clickstream pipeline that ingests 100,000 events/sec at peak, requires per-session ordering, replicates across 3 brokers, and survives one broker failure without data loss. Sketch the CLI command, the producer config, and the partition-count math.
Code (CLI + producer config).
# Create the topic — 20 partitions, RF=3, min.insync.replicas=2
kafka-topics.sh --bootstrap-server kafka-1:9092 \
--create --topic clickstream \
--partitions 20 \
--replication-factor 3 \
--config min.insync.replicas=2 \
--config retention.ms=604800000 # 7 days
# producer.py (Python, confluent-kafka)
from confluent_kafka import Producer
producer = Producer({
"bootstrap.servers": "kafka-1:9092,kafka-2:9092,kafka-3:9092",
"acks": "all", # wait for full ISR
"enable.idempotence": True, # dedup producer retries
"compression.type": "zstd",
"linger.ms": 10, # batch for 10ms
"batch.size": 65536,
})
def publish(event):
key = str(event["session_id"]).encode() # session_id keeps order per session
producer.produce("clickstream", key=key, value=json.dumps(event).encode())
producer.flush()
Step-by-step explanation.
- Partition count math. Single consumer sustains ≈ 5,000 events/sec → 100,000 ÷ 5,000 = 20 partitions. Over-provision lightly so future growth doesn't force a partition-count change (which would re-shuffle every key's home).
- Replication factor 3 + min.insync.replicas 2. A write commits when 2 of the 3 replicas have it. Survives one broker loss without data loss; rejects writes if only 1 ISR remains (caller observes failure instead of silent loss).
-
Partition key =
session_id. Per-session ordering is preserved becausehash(session_id) % 20routes all events for one session to the same partition. -
acks=all+enable.idempotence=true. Strongest durability with retry safety — duplicate appends on retry are deduplicated by the broker using the producer's PID + sequence number. -
linger.ms=10+batch.size=64KB. Producer batches up to 10ms or 64KB per partition before sending — trades 10ms of producer latency for ~5–10× throughput.
Output.
| Decision | Value | Why |
|---|---|---|
| Partitions | 20 | 100k/sec ÷ 5k per consumer |
| Replication factor | 3 | survive one broker loss |
| min.insync.replicas | 2 | block writes when < 2 ISR |
| Partition key | session_id | per-session ordering |
| acks | all | full-ISR durability |
| Idempotence | on | dedup producer retries |
| Compression | zstd | ~3× wire & disk savings |
Rule of thumb. Start at partitions ≈ peak_throughput / single_consumer_throughput, then over-provision 2–3× for growth. Don't go wild — every partition costs file handles, controller metadata, and rebalance time.
Kafka interview question on ingest sizing
A senior interviewer often shapes this round as: "Walk me through how you'd size the topic, the producer, and the durability config for a clickstream feed at 100k events/sec." It blends partition-count math, replication semantics, and producer tuning — the three muscles every kafka data engineer interview opens with.
Solution Using acks=all + min.insync.replicas=2 + 20 partitions
# Create the topic
kafka-topics.sh --bootstrap-server kafka-1:9092 \
--create --topic clickstream \
--partitions 20 --replication-factor 3 \
--config min.insync.replicas=2 --config retention.ms=604800000
Step-by-step trace.
| Event | Producer action | Broker action | ISR after |
|---|---|---|---|
| session=42, click=home | hash(42)%20 = P7 → send to P7 leader (Broker 2), acks=all | Broker 2 writes, replicates to Broker 3, Broker 1 — all 3 acks come back | {1,2,3} |
| session=42, click=cart | same partition P7 → leader Broker 2 | leader writes, replicates | {1,2,3} |
| Broker 3 network blip — falls behind by replica.lag.time.max.ms | producer continues with acks=all | Broker 3 removed from ISR | {1,2} |
| session=42, click=checkout | same partition P7 | leader Broker 2 + follower Broker 1 ack → commit (min.insync.replicas=2 satisfied) | {1,2} |
| Broker 1 also drops out | producer attempts send | leader Broker 2 alone → < min.insync.replicas → broker rejects write with NotEnoughReplicasException → producer surfaces error to caller |
{2} |
Output:
| Metric | Steady state | After Broker 3 dropped | After Broker 1 also dropped |
|---|---|---|---|
| Writable | yes | yes | no (write rejected) |
| Reads up to HW | latest | latest | latest |
| Data loss risk | none | none | none — writes rejected, no silent loss |
Why this works — concept by concept:
-
Partition as unit of parallelism — 20 partitions = up to 20 consumer threads doing parallel reads of
clickstream. Adding a 21st consumer wouldn't help. -
Partition key for per-session ordering —
hash(session_id) % 20keeps every event for one session on the same partition, where order is strictly preserved. -
acks=all — the producer waits for every ISR member to acknowledge the write before considering it committed. Combined with
min.insync.replicas=2, this guarantees that any acknowledged write survives at least one broker loss. - min.insync.replicas as a durability gate — if the ISR shrinks below the threshold, the broker prefers to reject writes (surfacing the error upstream) rather than silently commit something that could be lost.
-
Idempotent producer —
enable.idempotence=truetags each produce request with a PID + sequence number; the broker rejects duplicates from retries, so the producer can retry safely without creating duplicates downstream. - Cost — write latency = O(max ISR replica RTT); cluster throughput = O(brokers × disk bandwidth × partition spread); storage = O(events × retention_ms × replication_factor).
Streaming
Topic — streaming pipelines
Streaming pipeline problems (Kafka, partitions, ordering)
2. Topics, partitions, and ordering — the unit of parallelism
kafka partitions are the only thing Kafka guarantees ordering on — the rest is a routing decision
The mental model in one line: a topic is a logical name; a partition is the physical, ordered, append-only log; the partition key decides which partition every record lands on. Once you say "ordering is per-partition, never topic-wide," half of the kafka partitions consumer groups interview question family collapses to obvious answers.
The partition decisions every interview probes.
- Partition count. Upper bound on consumer parallelism, lower bound on file-handle and metadata overhead. Typical range: 6–100 per topic; rarely 1, rarely 1,000.
-
Partition key. Default partitioner is
hash(key) % numPartitions. Same key → same partition. No key → sticky round-robin (favours batching). -
Hot-key risk. One key generates 80% of the traffic → that partition is a bottleneck. Mitigations: composite key (
session_id + bucket), custom partitioner, salting. - Re-partitioning is expensive. Changing partition count rehomes every key; for ordered topics this is a major operation. Plan partition count up-front.
Three ordering questions interviewers always ask.
- "How do I guarantee global ordering across a topic?" — One partition. Trade-off: throughput caps at single-partition speed.
- "How do I preserve per-key ordering?" — Use the key as the partition key. All records with that key route to one partition.
- "How do I handle hot keys?" — Composite key, salting, or custom partitioner that distributes the hot tenant across N partitions and reassembles downstream.
Worked example — choose partition key for an orders topic
Detailed explanation. Order events for the same customer must be processed in sequence; orders for different customers can flow in parallel. The natural key is customer_id. With 12 partitions, every event for one customer routes to exactly one partition, so the consumer for that partition sees them in produce order.
Question. Three orders arrive for customer_id=42. With 12 partitions and default partitioner, on which partition do they land, and what does the consumer for that partition observe?
Code (Java-style pseudocode).
def partition_for(key, num_partitions):
return hash(key) & 0x7fffffff % num_partitions
partition_for("42", 12) # → P3 (illustrative)
producer.produce("orders", key=b"42", value=b'{"order_id":1001,"status":"placed"}')
producer.produce("orders", key=b"42", value=b'{"order_id":1001,"status":"paid"}')
producer.produce("orders", key=b"42", value=b'{"order_id":1001,"status":"shipped"}')
Step-by-step explanation.
- The producer hashes the key
"42"once per record. Result is deterministic — same key → same partition. - All three records route to partition P3.
- The consumer assigned to P3 reads them in offset order, which is the produce order: placed → paid → shipped.
- Consumers for other partitions never see these records — they don't need to, because per-customer ordering is local to one partition.
Output.
| order_id | status | partition | offset |
|---|---|---|---|
| 1001 | placed | P3 | 4711 |
| 1001 | paid | P3 | 4712 |
| 1001 | shipped | P3 | 4713 |
Rule of thumb. When in doubt about ordering, ask the interviewer: "Is per-key ordering enough, or do we need global ordering?" The answer is almost always per-key.
Kafka interview question on partition key choice
The probe usually sounds like: "How would you choose the partition key for a payment event stream where the same merchant_id may issue thousands of events per second?" — testing whether the candidate recognises a hot-key situation and reaches for a composite key or salting.
Solution Using composite key merchant_id + bucket
import hashlib, json
NUM_BUCKETS = 16 # increases parallelism for hot merchants
def composite_key(merchant_id, payment_id):
bucket = int(hashlib.md5(payment_id.encode()).hexdigest(), 16) % NUM_BUCKETS
return f"{merchant_id}:{bucket}".encode()
for event in payment_stream:
producer.produce(
"payments",
key=composite_key(event["merchant_id"], event["payment_id"]),
value=json.dumps(event).encode(),
)
Step-by-step trace.
| Event | merchant_id | payment_id | bucket | composite key | partition |
|---|---|---|---|---|---|
| 1 | 7 | pay_001 | 4 | "7:4" | P5 |
| 2 | 7 | pay_002 | 11 | "7:11" | P9 |
| 3 | 7 | pay_003 | 4 | "7:4" | P5 |
| 4 | 7 | pay_004 | 0 | "7:0" | P2 |
Output:
| Partition | Count of merchant=7 events |
|---|---|
| P2 | 1 |
| P5 | 2 |
| P9 | 1 |
Compared to a naive key=merchant_id strategy, where all four events would have landed on a single partition, the load is now spread across three partitions while still preserving ordering per (merchant_id, bucket) slice.
Why this works — concept by concept:
-
Hot-key dilution — adding a bucket suffix to the key splits one merchant's traffic across
NUM_BUCKETSpartitions instead of one. The hottest partition's load drops by ≈1 / NUM_BUCKETS. -
Local ordering preserved — events with the same
payment_idalways hash to the same bucket, so per-payment ordering is still strict. -
Downstream re-assembly — when a consumer needs all events for
merchant_id=7, it reads from all partitions and groups bymerchant_id. Slightly more code, vastly more headroom. -
Cost — partition spread = O(
NUM_BUCKETS); ordering scope shrinks from "per merchant" to "per(merchant_id, bucket)" — usually an acceptable trade-off.
Streaming
Topic — real-time analytics
Real-time analytics problems (hot keys, partitioning)
3. Replication, ISR, and the durability gate
kafka isr is the contract — only in-sync replicas can become leaders and only ISR-acked writes are committed
For each partition Kafka stores R copies — one leader, R−1 followers. A follower is in-sync if it has caught up with the leader within replica.lag.time.max.ms (default 30 seconds). The In-Sync Replicas (ISR) is the set of replicas eligible to become leader if the current leader fails, and the set whose acknowledgements acks=all waits for.
Replication concepts in five bullets.
- Replication factor (RF). Number of copies per partition. Typical production value: 3.
- Leader / follower. Producers and consumers only talk to the leader. Followers replicate the log asynchronously, ack to the leader once they've persisted each record.
- High Watermark (HW). The highest offset replicated to all ISR members. Consumers can only read up to HW — guaranteeing they only see committed data.
- Log End Offset (LEO). The highest offset written to a specific replica (may be > HW for the leader before followers catch up).
-
min.insync.replicas. Lower bound on the ISR size foracks=allwrites to succeed. With RF=3 andmin.insync.replicas=2, a single broker outage is tolerated; two outages reject new writes (visible failure beats silent loss).
When does a replica leave the ISR?
-
Falls behind. Lags the leader by more than
replica.lag.time.max.msof replication. - Network partition or broker crash. The follower stops fetching for a sustained period.
- Manual reassignment. Administrative operation moves the replica off the broker.
Unclean leader election (the dangerous knob).
-
unclean.leader.election.enable=false(default since Kafka 0.11) — Kafka refuses to elect a non-ISR replica even if the leader is dead and no ISR is left. Prefers downtime over silent loss. - Set it to
trueonly when availability strictly trumps durability (non-critical telemetry, idempotent downstream sinks). For ledgers, payments, anything compliance-bound — leave it off.
Worked example — what happens when an ISR shrinks below min.insync.replicas
Detailed explanation. A common interview probe is: "Broker B3 died — what happens to writes?" The answer depends on the original ISR, acks, and min.insync.replicas.
Question. A topic has RF=3 with brokers {B1, B2, B3}, min.insync.replicas=2, and acks=all. Initially ISR = {B1, B2, B3} and B1 is the leader. B3 crashes, then 40 seconds later B2 also experiences a network blip. What does each producer write observe?
Input (timeline).
| t (s) | Event | ISR | Leader |
|---|---|---|---|
| 0 | All healthy | {1,2,3} | B1 |
| 5 | B3 crashes | {1,2,3} → {1,2} after replica.lag.time.max.ms
|
B1 |
| 35 | ISR collapses to {1,2} | {1,2} | B1 |
| 70 | B2 network blip | {1,2} → {1} | B1 |
Code (producer behaviour).
producer = Producer({
"bootstrap.servers": "B1:9092,B2:9092,B3:9092",
"acks": "all",
"enable.idempotence": True,
"delivery.timeout.ms": 120000,
})
# At t=10 — write succeeds
producer.produce("orders", key=b"42", value=b'{"o":1}') # ✓
# At t=40 (ISR={1,2}, still ≥ min.insync.replicas) — write succeeds
producer.produce("orders", key=b"42", value=b'{"o":2}') # ✓
# At t=75 (ISR={1}, below min.insync.replicas=2) — write FAILS
producer.produce("orders", key=b"42", value=b'{"o":3}') # ✗ NotEnoughReplicasException
Step-by-step explanation.
-
t=10 — ISR is full.
acks=allwaits for B1 + B2 + B3 ack; all three deliver; offset is committed; HW advances. -
t=40 — B3 has dropped out of ISR, ISR is now
{B1, B2}.acks=allwaits for B1 + B2 ack — both deliver. ISR size (2) ≥min.insync.replicas(2), so the write commits. -
t=75 — B2 has also dropped. ISR is now
{B1}. ISR size (1) <min.insync.replicas(2). Broker rejects the produce request withNotEnoughReplicasException. Producer surfaces the error to the caller; the application must handle it (retry with backoff, alert, dead-letter, etc.). - Why is this the correct behaviour? Silent acceptance with only one replica would mean a subsequent B1 crash erases the write. The system prefers a visible failure to a hidden loss.
Output.
| Write | t (s) | Result | Reason |
|---|---|---|---|
{"o":1} |
10 | committed | ISR=3, full ack |
{"o":2} |
40 | committed | ISR=2 = min.insync.replicas |
{"o":3} |
75 | rejected | ISR=1 < min.insync.replicas |
Rule of thumb. With RF=3, set min.insync.replicas=2 and acks=all. You trade some availability (writes can be rejected when 2 brokers are simultaneously unhealthy) for the guarantee that any acknowledged write survives a single broker loss.
Kafka interview question on durability tuning
A senior interviewer might frame this as: "Your team wants acks=all + min.insync.replicas=2 to satisfy the ledger SLO, but pages every time a single broker is replaced. How would you reason about availability vs durability here?"
Solution Using durability-first config with operational guardrails
kafka-topics.sh --bootstrap-server kafka-1:9092 \
--alter --topic ledger \
--config min.insync.replicas=2 \
--config unclean.leader.election.enable=false
producer = Producer({
"bootstrap.servers": "kafka-1:9092,kafka-2:9092,kafka-3:9092",
"acks": "all",
"enable.idempotence": True,
"retries": 2147483647,
"max.in.flight.requests.per.connection": 5,
"delivery.timeout.ms": 120000,
})
Step-by-step trace.
| Step | Action | Effect |
|---|---|---|
| 1 | Schedule rolling broker replacements one at a time | ISR briefly drops to {2}, then re-grows to {3} as replica catches up |
| 2 |
min.insync.replicas=2 + acks=all
|
accepts writes throughout the single-broker swap |
| 3 |
enable.idempotence=true + high retries
|
producer transparently retries transient errors without creating duplicates |
| 4 | unclean.leader.election=false |
refuses to promote a non-ISR replica during the swap — no silent loss |
Output:
| Operation | Producer-side observation | Cluster-side state |
|---|---|---|
| Steady state | writes succeed | ISR = {1, 2, 3} |
| Broker B3 replaced | writes succeed | ISR = {1, 2} for ~30s, then {1, 2, 3} |
| Two brokers down (rare) | writes throw NotEnoughReplicasException
|
ISR = {1} — broker prefers reject over silent loss |
| Single broker permanent loss | writes succeed | ISR = {1, 2}; replacement rebuilds the third replica |
Why this works — concept by concept:
-
ISR as the durability set — only ISR members can become leader, so any
acks=all-committed write survives the loss of any single non-leader ISR member. - min.insync.replicas as the lower bound — by refusing writes when ISR drops to 1, the broker eliminates the window in which a single subsequent failure could erase a freshly-committed record.
-
Idempotent producer — combined with large
retries, transient broker errors (leader election, ISR shuffle) heal automatically without duplicate appends. - unclean.leader.election=false — refuses to elect a divergent replica as leader, which would otherwise serve as a backdoor for data loss.
- Cost — write latency = O(slowest ISR replica RTT); rejected-write rate during incidents ≈ small but visible — a deliberate trade for "no silent loss."
Streaming
Topic — streaming pipelines
Replication and durability problems
4. Producer semantics — acks, idempotence, batching, compression
Producer config is a four-axis trade-off — durability, latency, throughput, and dedup
The producer config is where senior kafka data engineer interview rounds spend the most time, because every knob has a sharp trade-off. Once you can articulate the relationship between acks, enable.idempotence, linger.ms/batch.size, and compression.type, you can defend any production producer profile.
The four axes.
-
acks— durability.-
acks=0— fire-and-forget. Lowest latency, weakest durability. Use only for telemetry where loss is acceptable. -
acks=1— leader ack. Loses any write where the leader crashed before followers replicated. -
acks=all— full ISR ack. The default for production data.
-
-
enable.idempotence— duplicates.-
true(default in modern clients) — the broker tags each record with PID + sequence number; retries don't create duplicates. -
false— retries CAN double-append. Almost never the right answer.
-
-
Batching — throughput vs latency.
-
batch.size(bytes per partition batch) andlinger.ms(max wait time before send). - Bigger batches → higher throughput, more producer-side memory, higher latency.
-
-
Compression — bandwidth and disk.
-
compression.type=snappy— balanced. Default-good. -
zstd— best ratio, more CPU. -
lz4— fastest decompress. -
gzip— densest historically, more CPU.
-
The five most-asked producer interview questions.
- "What's the difference between
acks=1andacks=all?" — leader-only vs full-ISR durability. - "How does idempotence prevent duplicates?" — PID + sequence number deduplication at the broker.
- "What's the trade-off of increasing
batch.size?" — throughput up, latency up, memory up. - "When is
compression.type=zstdworth it?" — when bandwidth or disk dominates and CPU has headroom. - "What's a
QueueFullException?" — the producer accumulator filled because the broker can't keep up; back off or scale brokers.
Worked example — producer ships duplicate records, debug it
Detailed explanation. A team reports duplicate orders in the downstream warehouse. The producer is sending to Kafka with retries on and idempotence off — a classic interview trap.
Question. Given a producer with acks=all, retries=10, enable.idempotence=false, and max.in.flight.requests.per.connection=5, explain how a transient network error can produce duplicate records and how to fix it.
Code (the broken producer).
producer = Producer({
"bootstrap.servers": "kafka-1:9092",
"acks": "all",
"retries": 10,
"enable.idempotence": False, # ⚠️ trap
"max.in.flight.requests.per.connection": 5,
})
producer.produce("orders", key=b"42", value=b'{"order_id":1001}')
producer.flush()
Step-by-step explanation.
- The producer sends record R1 to the leader.
- The leader writes R1, replicates to ISR, and sends an ack back.
- The ack is lost in the network (TCP RST, broker restart at the wrong moment).
- The producer doesn't see the ack, so it retries with a fresh record. The broker sees this as a new record — there's no PID/sequence number to dedupe on.
- R1 is now committed twice. Two records, same payload, different offsets.
Output (before fix).
| Offset | Record | Notes |
|---|---|---|
| 4711 | {"order_id":1001} | original send |
| 4712 | {"order_id":1001} | duplicate from retry |
The fix.
producer = Producer({
"bootstrap.servers": "kafka-1:9092",
"acks": "all",
"enable.idempotence": True, # ✓ dedup retries
"max.in.flight.requests.per.connection": 5,
"retries": 2147483647,
})
Output (after fix).
| Offset | Record | Notes |
|---|---|---|
| 4711 | {"order_id":1001} | only one append — broker detected duplicate PID+seq |
Rule of thumb. enable.idempotence=true should be the default for every production producer. The cost is negligible (a producer ID handshake at startup); the safety is large.
Kafka interview question on tuning producer throughput
A scaling-focused round often asks: "Throughput is at 60% of broker capacity but producers complain about latency. How would you tune?"
Solution Using linger.ms + batch.size + compression.type=zstd
producer = Producer({
"bootstrap.servers": "kafka-1:9092,kafka-2:9092,kafka-3:9092",
"acks": "all",
"enable.idempotence": True,
"compression.type": "zstd",
"linger.ms": 20, # batch up to 20ms
"batch.size": 131072, # 128KB per-partition batch
"buffer.memory": 67108864, # 64MB accumulator
})
Step-by-step trace.
| Step | Knob | Effect |
|---|---|---|
| 1 | linger.ms=20 |
producer waits up to 20ms after the first record before sending the batch |
| 2 | batch.size=128KB |
once the batch reaches 128KB, send immediately even if linger.ms hasn't elapsed |
| 3 | compression.type=zstd |
the entire batch is compressed once; broker stores it compressed; consumers decompress on read |
| 4 | acks=all |
latency = max RTT to the slowest ISR replica; for the batch as a whole, not per-record |
| 5 | enable.idempotence=true |
PID + sequence number per partition prevent duplicates on retry |
Output:
| Metric | Before | After |
|---|---|---|
| Producer throughput | 60% capacity | 95%+ capacity |
| Per-record latency p50 | 8 ms | 24 ms |
| Per-record latency p99 | 80 ms | 120 ms |
| Wire bytes | 1.0× | ~0.35× (zstd) |
Why this works — concept by concept:
-
Batching —
linger.ms+batch.sizegroup records destined for the same partition into one network call. Fewer round-trips per record → broker handles more work per CPU cycle. - Compression — zstd compresses the whole batch once; producer pays a small CPU cost, network bytes drop 3×+, broker disk usage drops correspondingly.
-
Idempotent producer — preserves at-most-once-per-record semantics under retry; required to safely run with
retriescranked up. - acks=all — committed bytes must reach the full ISR; not loosened to gain the throughput.
- Cost — latency rises ~3× at p50; throughput rises ~60%. Acceptable when bytes/sec is the bottleneck, not p99 latency.
Streaming
Topic — streaming pipelines
Producer tuning and dedup problems
5. Consumer groups, offsets, and cooperative-sticky rebalancing
kafka consumer group rebalance is the operational story senior interviewers probe most
A consumer group is a set of consumers that share the partitions of a topic. Kafka assigns each partition to exactly one consumer in the group, so the maximum parallelism equals the partition count. Two consumer groups reading the same topic each get the full message stream independently — that's how Kafka delivers both queue and pub-sub semantics from one substrate.
The four concepts every consumer-group interview probes.
-
Offset. Position in a partition. Stored in the internal
__consumer_offsetstopic, partitioned by group id. -
Heartbeat + session.timeout.ms. Consumer sends heartbeats to the group coordinator; missing them for
session.timeout.ms(default 45s in Kafka 4.x) marks the consumer dead. -
max.poll.interval.ms. Covers processing time between polls. If
poll()isn't called within this window (default 5 min), the consumer is considered stuck and is kicked out. - Rebalance. Triggered by consumer join/leave or topic metadata change. The group coordinator redistributes partitions.
Assignment strategies.
- RangeAssignor — contiguous ranges per consumer.
- RoundRobinAssignor — even distribution across all consumers.
- StickyAssignor — even + minimal movement.
- CooperativeStickyAssignor (default, Kafka 4.x) — sticky + only the affected partitions revoke during rebalance (no stop-the-world).
Manual offset commit (the durability story).
-
enable.auto.commit=true— periodic commits in the background. Easy, but at-least-once is harder to reason about. -
enable.auto.commit=false+consumer.commit()after processing succeeds — the common production choice. Commit only after the side effect (warehouse insert, downstream produce, etc.) completes.
Worked example — trace a cooperative-sticky rebalance
Detailed explanation. Four consumers (C1..C4) share an 8-partition topic. C3 dies. The default cooperative-sticky assignor moves only C3's partitions, leaving the others untouched — every other consumer keeps processing.
Question. Group has 4 consumers and the topic has 8 partitions. Walk the assignment before and after C3 dies under cooperative-sticky.
Input (before).
| Consumer | Partitions |
|---|---|
| C1 | P0, P1 |
| C2 | P2, P3 |
| C3 | P4, P5 |
| C4 | P6, P7 |
Code (consumer config).
consumer = Consumer({
"bootstrap.servers": "kafka-1:9092",
"group.id": "order-pipeline",
"enable.auto.commit": False,
"isolation.level": "read_committed",
"partition.assignment.strategy": "cooperative-sticky",
"session.timeout.ms": 45000,
"max.poll.interval.ms": 300000,
})
consumer.subscribe(["orders"])
Step-by-step explanation.
- t=0 — Steady state. Each consumer polls every 1–2 seconds; heartbeats arrive every 3 seconds.
-
t=10 —
C3crashes (process killed). Heartbeats stop. -
t=10 + session.timeout.ms — Group coordinator marks
C3as dead and triggers a rebalance. -
Cooperative-sticky decision —
C1, C2, C4do not revoke their partitions; onlyC3's partitions need a new home. The coordinator redistributes{P4, P5}to the surviving consumers (one each). -
t=10 + ~5s —
C1is now assigned{P0, P1, P4}andC2is now assigned{P2, P3, P5}.C4stays at{P6, P7}— completely unaffected. - t=10 + ~5s — All three surviving consumers continue polling; total downtime ≈ a few seconds for the affected partitions only.
Output.
| Consumer | Partitions before | Partitions after |
|---|---|---|
| C1 | P0, P1 | P0, P1, P4 |
| C2 | P2, P3 | P2, P3, P5 |
| C3 | P4, P5 | (dead) |
| C4 | P6, P7 | P6, P7 (untouched) |
Rule of thumb. Under cooperative-sticky, a single-consumer failure only pauses the 2 partitions the dead consumer owned, for a few seconds. Under the older Eager protocol, all 8 partitions would have paused for the rebalance — multi-second stop-the-world on the whole group.
Kafka interview question on at-least-once consumer
A common probe: "Walk me through an at-least-once consumer that writes to BigQuery. Where does the commit go?"
Solution Using manual commitSync() after successful write
consumer = Consumer({
"bootstrap.servers": "kafka-1:9092",
"group.id": "orders-to-bq",
"enable.auto.commit": False,
"auto.offset.reset": "earliest",
"isolation.level": "read_committed",
})
consumer.subscribe(["orders"])
BATCH = []
BATCH_SIZE = 500
while True:
msg = consumer.poll(1.0)
if msg is None or msg.error():
continue
BATCH.append(json.loads(msg.value()))
if len(BATCH) >= BATCH_SIZE:
write_to_bigquery(BATCH) # ① side effect first
consumer.commit(asynchronous=False) # ② then commit offsets
BATCH.clear()
Step-by-step trace.
| Event | Action | Committed offset | Notes |
|---|---|---|---|
| poll batch 1 (500 records) | write to BigQuery | — | BQ write succeeds |
| commit | commitSync | offset advanced by 500 | safe checkpoint |
| poll batch 2, BQ write fails | (no commit) | unchanged | retry on next poll cycle |
| poll batch 2 again (BQ healthy) | write to BigQuery | — | BQ write succeeds |
| commit | commitSync | offset advanced by 500 | next 500 records check-pointed |
| consumer crash mid-batch | (no commit) | unchanged | restart re-reads from last commit |
Output:
| Metric | Value |
|---|---|
| Delivery semantics | at-least-once (duplicates possible if BQ retry resends) |
| Required BQ guard | idempotent UPSERT on order_id |
| Pause-on-failure behaviour | yes — consumer retries the same batch until BQ accepts |
Why this works — concept by concept:
- Process-then-commit ordering — committing after the side effect succeeds means a crash between the two operations re-delivers the batch on restart. No data loss.
-
Idempotent sink — BigQuery
MERGEkeyed onorder_iddeduplicates any re-delivered batch. The pipeline is end-to-end at-least-once with the warehouse handling the dedup. - cooperative-sticky — a single consumer crash only re-distributes its partitions; the rest of the group keeps moving.
- isolation.level=read_committed — if upstream producers use transactions, the consumer skips uncommitted records (no half-written reads).
-
Cost — checkpoint frequency = 500 records / batch → bounded re-processing on crash; commit latency on
commitSyncis one broker RTT.
Streaming
Topic — streaming (Python)
Consumer group rebalance + offset problems
6. Exactly-once semantics and Kafka transactions
kafka exactly once semantics is idempotent producer + transactions + read_committed — all three
Exactly-once isn't magic — it's the composition of three orthogonal features. The senior apache kafka interview questions litmus test is "can you name all three and explain what each one solves?"
The three components of EOS.
-
Idempotent producer (
enable.idempotence=true) — eliminates producer-side duplicates from retries. -
Transactions (
transactional.id=…) — atomically commit writes to multiple partitions plus consumer offsets in one all-or-nothing operation. -
isolation.level=read_committedon consumers — only see records from committed transactions; skip the in-flight ones.
The transactions API in five calls.
-
producer.init_transactions()— register thetransactional.idwith the transactional coordinator (a broker). -
producer.begin_transaction()— open a new transaction. -
producer.produce(...)— buffer writes to the participating partitions. -
producer.send_offsets_to_transaction(...)— pass the consumer offsets into the transaction. -
producer.commit_transaction()(orabort_transaction()) — atomically commit (or roll back) every write plus the offset advance.
Zombie fencing.
- Each
transactional.idhas a producer epoch. When a new instance starts with the sametransactional.id, the coordinator increments the epoch and the previous instance's writes are rejected. Even if a stuck old worker comes back, it cannot produce duplicates.
Where EOS doesn't extend.
- External sinks (BigQuery, S3, Snowflake) are outside the Kafka boundary. EOS guarantees end at the output Kafka topic.
- For end-to-end exactly-once into an external warehouse, you still need an idempotent UPSERT keyed on a business id.
Worked example — transactional consume-transform-produce
Detailed explanation. An ETL worker reads from raw-events, enriches each record, and writes to enriched-events. Without EOS, a crash between the produce and the offset commit would re-deliver the same record after restart and the enriched output topic would contain duplicates. With EOS, the produce and the offset advance commit atomically.
Question. Three input records (k=42, v={"a":1}), (k=42, v={"a":2}), (k=42, v={"a":3}) arrive on raw-events. Walk through a transactional consume-transform-produce loop showing what lands on enriched-events, what is recorded in __consumer_offsets, and what is recorded in __transaction_state.
Code.
from confluent_kafka import Consumer, Producer
import json
consumer = Consumer({
"bootstrap.servers": "kafka-1:9092",
"group.id": "etl-transformer",
"enable.auto.commit": False,
"isolation.level": "read_committed",
})
producer = Producer({
"bootstrap.servers": "kafka-1:9092",
"transactional.id": "etl-transformer-001",
"acks": "all",
"enable.idempotence": True,
})
producer.init_transactions()
consumer.subscribe(["raw-events"])
while True:
msg = consumer.poll(1.0)
if msg is None or msg.error():
continue
producer.begin_transaction()
try:
raw = json.loads(msg.value())
enriched = {**raw, "enriched": True, "ts": "2026-05-27T00:00:00Z"}
producer.produce(
"enriched-events",
key=msg.key(),
value=json.dumps(enriched).encode(),
)
producer.send_offsets_to_transaction(
consumer.position(consumer.assignment()),
consumer.consumer_group_metadata(),
)
producer.commit_transaction()
except Exception:
producer.abort_transaction()
Step-by-step trace.
| Msg | Step |
enriched-events (P0) |
__consumer_offsets (etl-transformer) |
__transaction_state |
|---|---|---|---|---|
| 1 | begin txn | — | — | txn=etl-001-7, status=PREPARED |
| 1 | produce + sendOffsets |
{"a":1,"enriched":true} (uncommitted marker) |
offset=101 (uncommitted) | (still PREPARED) |
| 1 | commit txn | mark COMMITTED | mark COMMITTED | txn=etl-001-7, status=COMMITTED |
| 2 | full loop |
{"a":2,"enriched":true} COMMITTED |
offset=102 COMMITTED | txn=etl-001-8, status=COMMITTED |
| 3 | full loop |
{"a":3,"enriched":true} COMMITTED |
offset=103 COMMITTED | txn=etl-001-9, status=COMMITTED |
Output:
| Topic / Internal | Final state |
|---|---|
raw-events (input) |
unchanged — 3 records |
enriched-events (output) |
3 COMMITTED records, each with "enriched":true
|
__consumer_offsets |
etl-transformer.raw-events.P0 → offset 103 |
__transaction_state |
three COMMITTED markers (epoch 7, 8, 9) |
Why this works — concept by concept:
-
Idempotent producer — PID + sequence number means a retry of the same
produceis deduplicated at the broker. No duplicate appends from network blips. -
Atomic transaction — the
producetoenriched-eventsand the offset commit to__consumer_offsetssucceed together or roll back together. There is no window where the output exists but the offset is unmoved (which would cause re-delivery on restart and duplicates downstream). -
read_committed consumers — anyone downstream of
enriched-eventsonly sees committed records, not the in-flight ones. The transaction marker is what flips them from "visible" to "skip." -
Producer epoch / zombie fencing — if the worker crashes and a replacement starts with the same
transactional.id, the coordinator bumps the epoch and the old worker's pending writes are rejected. No "I came back to life" duplicates. - Cost — one transactional coordinator RTT per transaction; throughput remains high if each transaction batches many records, which the loop above does implicitly via poll batching.
Streaming
Topic — streaming (Python)
Exactly-once transactional pipelines (Python)
7. Kafka Connect, Schema Registry, and Kafka Streams — the ecosystem
The integration surface — Connect for source/sink, Schema Registry for evolution, Streams for in-Kafka processing
Most Kafka interview rounds end with one ecosystem question — "When would you reach for Kafka Connect vs writing a custom consumer?" or "Streams or Flink?" The right answer is a one-line decision tree.
Kafka Connect (read this section as a deferral pointer too).
- A standalone framework + cluster for moving data between Kafka and external systems.
- Source connectors — JDBC (PostgreSQL, MySQL), Debezium (log-based CDC), MongoDB, S3 (file).
- Sink connectors — BigQuery, Snowflake, S3, Elasticsearch.
- Single Message Transforms (SMTs) for inline routing, casting, masking.
- Detailed Debezium CDC patterns are covered in the dedicated CDC for Data Engineering Interviews guide — here we use Connect as the substrate only.
Schema Registry.
- Stores Avro / JSON Schema / Protobuf schemas keyed by
subject(typicallytopic-valueortopic-key). - Producer attaches a 4-byte schema id to each message; consumer fetches the schema by id.
-
Compatibility modes —
BACKWARD(consumers handle new + old; default),FORWARD,FULL,NONE. PickBACKWARDfor log-style topics where consumers upgrade after producers.
Kafka Streams.
- JVM library (Java / Scala / Kotlin) for stateful stream processing on top of Kafka.
- Stateful operators —
count,reduce,aggregate, windowed joins. - KStream (stream of changes) vs KTable (latest value per key) duality.
- Runs inside your application — no extra cluster.
- For richer event-time semantics, watermarks, and out-of-order handling at scale, reach for Apache Flink (covered in the dedicated Flink interview guide).
ksqlDB.
- A standalone server that exposes Kafka Streams primitives as SQL. Same engine, easier surface for SQL-only teams.
Worked example — choose Connect vs custom consumer for a sink
Detailed explanation. A team needs to land every event from enriched-events into Snowflake. Options: write a custom consumer that batches and uses Snowflake's COPY, or run the Confluent Snowflake Sink Connector.
Question. Compare a custom Python consumer to Kafka Connect for a sink to Snowflake on five axes: code, offset management, retries, schema evolution, observability.
Code (the Connect config — no custom code).
{
"name": "enriched-to-snowflake",
"config": {
"connector.class": "com.snowflake.kafka.connector.SnowflakeSinkConnector",
"topics": "enriched-events",
"snowflake.url.name": "myorg-myaccount.snowflakecomputing.com",
"snowflake.user.name": "kafka_loader",
"snowflake.private.key": "${file:/secrets/sf.key:key}",
"snowflake.database.name": "PROD",
"snowflake.schema.name": "RAW",
"key.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url": "http://schema-registry:8081",
"tasks.max": "4",
"buffer.flush.time": "60",
"buffer.count.records": "10000",
"buffer.size.bytes": "100000000"
}
}
Step-by-step explanation.
- The connector is a long-running task inside the Connect worker cluster.
- It consumes from
enriched-eventsvia Kafka's normal consumer-group mechanism (group id is the connector name). - Records are buffered up to
buffer.count.records=10000orbuffer.flush.time=60s, whichever comes first. - Buffered records are written to a Snowflake-managed internal stage, then
COPYd into the target table. - Connect manages offsets — commits only after each
COPYcompletes successfully. - Schema evolution: when a producer publishes a new schema id, the connector fetches it from the Schema Registry and propagates the column changes (additive) to Snowflake.
Output (comparison).
| Axis | Custom consumer | Kafka Connect |
|---|---|---|
| Code | several hundred lines of Python | zero, just JSON config |
| Offset management | manual commitSync
|
built-in |
| Retries / dead-letter | manual | built-in DLQ topic |
| Schema evolution | manual | automatic via Schema Registry |
| Observability | custom metrics | JMX + REST metrics out of the box |
Rule of thumb. Reach for Kafka Connect first for source-to-Kafka and Kafka-to-sink. Write custom consumer code only when you need something Connect's connectors don't cover — bespoke business logic, exotic protocols, or unusual transaction boundaries.
Kafka interview question on choosing Streams vs Flink
A senior probe: "We need windowed rolling counts of clickstream events per user, joined against a user-profile changelog. Streams or Flink?"
Solution Using Kafka Streams DSL when state fits in-app
// orders-per-user-windowed.java (Kafka Streams DSL — illustrative)
StreamsBuilder builder = new StreamsBuilder();
KStream<String, ClickEvent> clicks = builder.stream("clicks",
Consumed.with(Serdes.String(), clickSerde));
KTable<String, UserProfile> profiles = builder.table("user-profiles",
Consumed.with(Serdes.String(), profileSerde));
KStream<String, EnrichedClick> enriched =
clicks.join(profiles, EnrichedClick::new,
Joined.with(Serdes.String(), clickSerde, profileSerde));
enriched
.groupByKey()
.windowedBy(TimeWindows.of(Duration.ofMinutes(5)).advanceBy(Duration.ofMinutes(1)))
.count(Materialized.as("clicks-per-user-5m"))
.toStream()
.to("clicks-per-user-5m-output");
Step-by-step trace.
| Stage | Operator | What it does |
|---|---|---|
| 1 | stream("clicks") |
reads click events from Kafka |
| 2 | table("user-profiles") |
reads profile changelog as a KTable (latest value per user_id) |
| 3 | clicks.join(profiles) |
stream-table join — enriches each click with the current profile |
| 4 | groupByKey().windowedBy(5m advance 1m) |
hopping window — 5-minute window every 1 minute |
| 5 | count() |
maintains a state store of rolling counts per (user_id, window) |
| 6 | to("output") |
publishes the windowed counts as a Kafka topic |
Output:
| user_id | window_start | window_end | count |
|---|---|---|---|
| u_42 | 12:00 | 12:05 | 17 |
| u_42 | 12:01 | 12:06 | 19 |
| u_99 | 12:00 | 12:05 | 4 |
Why this works — concept by concept:
- KStream + KTable duality — the changelog table is naturally modelled as a KTable; joining it with a stream gives "click events enriched with the user's profile at the time of join."
-
Hopping window —
windowedBy(5m, advanceBy=1m)gives a smoother rolling-count series than tumbling windows; useful for dashboards. - State store + changelog topic — Streams persists the rolling count locally (RocksDB) and to a compacted changelog topic in Kafka; on restart the local store is rebuilt from the changelog.
- No extra cluster — Streams runs as threads inside your application; one fewer system to operate.
- When to switch to Flink — if you need rich event-time semantics, very large state (TB-scale), or watermark-driven late-event handling, Flink is the right next step (covered in the Flink interview guide).
- Cost — local state size = O(active_window_keys × window_count); compute = O(events) per join.
Streaming
Topic — real-time analytics
Windowed analytics on streams (Kafka Streams / Flink)
Choosing the right Kafka primitive (cheat sheet)
- Need ordering across many readers? Use a partition key that matches the ordering scope (per-user, per-session, per-tenant). Global ordering = one partition (and cap on throughput).
-
Need durability against one broker loss?
replication.factor=3+min.insync.replicas=2+acks=all+unclean.leader.election=false. -
Need to safely retry producer sends?
enable.idempotence=true— turn it on and forget it. -
Need to maximise producer throughput? Increase
linger.msandbatch.size; enablecompression.type=zstd. -
Need to recover an at-least-once consumer cleanly?
enable.auto.commit=false;commitSync()after the downstream side effect succeeds. -
Need exactly-once across topics? Idempotent producer + transactions +
isolation.level=read_committed. Don't skip any of the three. - Need to land Kafka data in a warehouse? Use Kafka Connect with a sink connector before writing a custom consumer.
- Need windowed aggregations on streams? Kafka Streams if state fits in-app; Flink if state is huge or you need rich event-time semantics.
Frequently asked questions
How is Apache Kafka different from a traditional message queue like RabbitMQ?
Kafka is a distributed commit log; RabbitMQ is a queue with delete-on-consume semantics. Kafka persists messages for the entire retention window, so multiple consumer groups can read the same data independently and replay history is a first-class operation. Kafka is pull-based and uses sequential disk writes + zero-copy network transfers — it sustains millions of writes per second on commodity hardware. RabbitMQ shines for low-latency task queues and RPC patterns; Kafka shines for event streaming, CDC, and real-time analytics.
How does Kafka achieve exactly-once semantics, and what does it not cover?
Exactly-once in Kafka is the composition of three features: idempotent producer (enable.idempotence=true) deduplicates producer retries via a producer ID and sequence number; transactions (transactional.id) atomically commit writes to multiple partitions plus consumer offsets in one all-or-nothing operation; isolation.level=read_committed on consumers makes them skip in-flight transaction records. EOS guarantees end at the boundary of the output Kafka topic — for external sinks like Snowflake or BigQuery you still need an idempotent UPSERT keyed on a business id.
What is an In-Sync Replica (ISR) and why does min.insync.replicas matter?
The ISR is the set of replicas that have caught up with the leader within replica.lag.time.max.ms. Only ISR members can be elected leader, and acks=all waits for every ISR member to acknowledge the write. min.insync.replicas=2 (with RF=3) tells the broker to reject writes when the ISR shrinks to 1 — this is a deliberate trade: you'd rather see a visible write failure than have a single subsequent broker loss erase recently-committed data.
What triggers a Kafka consumer group rebalance and how does cooperative-sticky help?
A rebalance is triggered when a consumer joins or leaves the group, or when topic metadata changes (e.g. partitions added). Kafka 4.x defaults to cooperative-sticky rebalancing — only the partitions belonging to the affected consumer are reassigned; surviving consumers keep their existing partitions and don't pause. This is dramatically less disruptive than the older eager protocol, which paused every consumer in the group during every rebalance.
When should I use Kafka Connect instead of a custom consumer?
Reach for Kafka Connect first when you're moving data between Kafka and any common external system (databases via JDBC or Debezium, S3, Snowflake, BigQuery, Elasticsearch). You get offset management, retries with dead-letter topics, parallelism, schema evolution via the Schema Registry, and JMX/REST observability without writing code. Write a custom consumer only when you need business logic that doesn't fit a connector's SMT model or you need exotic transaction boundaries.
Should I use Kafka Streams or Apache Flink for stream processing?
Use Kafka Streams when your state fits comfortably in-app (low GBs), you're already on the JVM, and you want one fewer cluster to operate — it runs as threads inside your application and persists state to RocksDB + a Kafka changelog topic. Use Apache Flink when you need very large state (TB-scale), rich event-time semantics with watermarks, late-event handling, or you want stream processing decoupled from the producer applications. Flink is more powerful; Streams is simpler to deploy.
Practice on PipeCode
- Drill the streaming practice library → for end-to-end pipeline questions on partitioning, ordering, and EOS.
- Rehearse real-time analytics drills → for windowed aggregations and rolling counts.
- Sharpen Python streaming problems → when the interviewer wants
confluent-kafkacode, not pseudocode. - For the broader interview surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Reinforce the compute side with Apache Spark internals for DE interviews →.
- For the design-round muscles, work through ETL system design for DE interviews →.
- To pair Kafka with table modelling, browse data modelling for DE interviews →.
Pipecode.ai is Leetcode for Data Engineering — every Kafka concept above ships with hands-on practice rooms where you partition real data, configure real producers, and trace real rebalances. Start with the streaming library and work outward; PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine.





Top comments (0)