DEV Community

Hitesh
Hitesh

Posted on

Kafka interview questions for Data Engineer

Here are Kafka interview questions tailored for AI / Backend / Data Engineer (0–3 yrs) roles, with clear, crisp answers you can revise quickly.
I’ve grouped them by difficulty & topic, exactly how interviews flow.


1️⃣ Kafka Basics (Must-Know)

Q1. What is Kafka?

Answer:
Kafka is a distributed event-streaming platform used for publishing, storing, and consuming streams of records in real time. It is designed for high throughput, fault tolerance, and scalability.


Q2. Kafka vs traditional message queue?

Answer:

Kafka Message Queue
Pull-based Push-based
Persistent logs Messages deleted after consume
Multiple consumers Usually single consumer
Replay supported Replay difficult

Q3. What is a topic?

Answer:
A topic is a logical stream of messages where data is written and read.


Q4. What is a partition?

Answer:
A partition is a unit of parallelism in Kafka. Each partition is an ordered, immutable log.


Q5. Why partitions matter?

Answer:
They enable:

  • Parallel processing
  • Horizontal scalability
  • Ordered processing per partition

2️⃣ Producers & Consumers

Q6. What is a Kafka producer?

Answer:
A producer publishes records to Kafka topics.


Q7. How does Kafka decide which partition to write to?

Answer:

  1. Key provided → hash(key) % partitions
  2. No key → round-robin

Q8. What is a Kafka consumer?

Answer:
A consumer reads records from topics.


Q9. What is a consumer group?

Answer:
A consumer group is a set of consumers that share the load of reading partitions.

👉 One partition → one consumer within a group


Q10. What happens if consumers > partitions?

Answer:
Extra consumers stay idle.


3️⃣ Offsets & Delivery Semantics

Q11. What is an offset?

Answer:
An offset is a unique position of a record in a partition.


Q12. How does Kafka track offsets?

Answer:
Offsets are stored in Kafka’s internal topic: __consumer_offsets.


Q13. At-least-once vs At-most-once?

Answer:

  • At-least-once: no data loss, duplicates possible
  • At-most-once: no duplicates, data loss possible

Q14. Does Kafka support exactly-once?

Answer:
Yes, using:

  • Idempotent producers
  • Transactions

4️⃣ Fault Tolerance & Replication

Q15. What is a broker?

Answer:
A broker is a Kafka server that stores and serves data.


Q16. What is replication factor?

Answer:
Number of copies of a partition across brokers.


Q17. What is leader and follower?

Answer:

  • Leader: handles reads/writes
  • Follower: replicates leader data

Q18. What happens if leader fails?

Answer:
A follower is automatically elected as the new leader.


5️⃣ Performance & Reliability

Q19. Why is Kafka fast?

Answer:

  • Sequential disk writes
  • Zero-copy transfer
  • Batching
  • Page cache usage

Q20. What is ISR?

Answer:
ISR (In-Sync Replicas) are replicas fully caught up with the leader.


Q21. What is acks in producer?

Answer:

  • acks=0 → no guarantee
  • acks=1 → leader only
  • acks=all → leader + replicas

6️⃣ Real-World Scenarios (Very Important)

Q22. How do you ensure message ordering?

Answer:
Use the same key so messages go to the same partition.


Q23. How to handle duplicate messages?

Answer:

  • Idempotent consumers
  • Deduplication using unique IDs
  • Exactly-once semantics

Q24. How to reprocess old data?

Answer:
Reset consumer offsets to an earlier value.


Q25. Kafka vs RabbitMQ?

Answer:

  • Kafka → high throughput, replay, streaming
  • RabbitMQ → low latency, task queues

7️⃣ Kafka + Data Engineering / AI

Q26. Kafka in ETL pipelines?

Answer:
Kafka acts as a buffer and ingestion layer between producers and downstream ETL systems.


Q27. Kafka with Spark / Flink?

Answer:
Kafka provides real-time data streams; Spark/Flink process them.


Q28. Kafka for ML pipelines?

Answer:
Used for:

  • Real-time feature ingestion
  • Streaming inference
  • Online model updates

8️⃣ Configuration & Monitoring

Q29. How do you monitor Kafka?

Answer:

  • Lag
  • Throughput
  • Broker health
  • Consumer offsets

Tools:

  • Prometheus + Grafana
  • CloudWatch (MSK)

Q30. What is consumer lag?

Answer:
Difference between latest offset and consumer offset.


9️⃣ Advanced (Bonus)

Q31. What is log compaction?

Answer:
Kafka keeps the latest record per key, deleting older ones.


Q32. What is retention policy?

Answer:

  • Time-based
  • Size-based

Q33. Schema Registry?

Answer:
Manages schema evolution for messages (Avro/Protobuf).


🔥 5 One-Line Interview Killers

Memorize these:

  • “Kafka is a distributed commit log.”
  • “Partitions give scalability; keys give ordering.”
  • “Offsets enable replayability.”
  • “Consumer groups provide horizontal scaling.”
  • “Exactly-once requires idempotent producers and transactions.”

🎯 How to Answer Like a Pro

When stuck, say:

“In production, the choice depends on throughput, ordering, and replay requirements.”

Top comments (0)