Forem

Cover image for Learning Kafka: Introduction
Muhammad Ibrahim Anis
Muhammad Ibrahim Anis

Posted on • Edited on

Learning Kafka: Introduction

Hello there and welcome to this series on Apache Kafka called Learning Kafka (I know, not inventive). In this series, we, (that is me and you) will embark on an adventure, to a faraway kingdom, where we’ll meet the protagonist of this series, called Kafka. Kafka was born an orphan on the street of ……… okay, that’s enough.

Now Learning Kafka, though not hard, can be quite complicated. Learning Kafka is definitely complicated for those of us new to data engineering, or systems design or who’ve never before worked with a distributed system. It seems like there is an endless stream (read that again) of terminologies to learn. Streams, producers, consumers, brokers, topics, partitions, offset, replication, connect, cluster, serialization, deserialization, distributed, throughput, latency, and on and on and on. Don’t run away yet, if these words sound like something someone will include in a master’s thesis, hopefully, by the end of this series, you’ll leave with your own master’s certificate.

For those who are familiar with other big data frameworks like Hadoop, Spark, Storm or any other distributed framework, some of these concepts will be familiar or easy to pick up. But for those of us who are fortunate (or unfortunate) to learn Kafka as our first distributed and/or big data framework, it seems like we are not just learning Kafka but a whole new ecosystem. Which, to a point, is true. Because you can’t truly understand Kafka without knowing how distributed system works, or what a Pub/Sub is.

By the end of this series, hopefully you’ll become not only familiar with these terms but also how they relate to Kafka.

One of the reasons learning Kafka is rather daunting, is we lack a somewhat detailed view of Kafka. On the surface, Kafka is a system for building real time data pipelines, good and fine. But when we try to build the promised data pipeline, things get complicated really really quick.

Also, the loosely coupled architecture of Kafka, one of the reasons it is so successful, is also the reason it can be hard to grasp. Because to understand Kafka, we must first understand all of its components independently, then figure out how they relate to each other.

And that is what we will be doing in this series. We’ll take a step back, study in depth Kafka’s design, architecture, components and how they all fit and work together.

This series will be divided into six parts; in part one, we will have a look at what Kafka is, its origin, use cases and features. In part two, an introduction to the core components of Kafka, like brokers, topics, partitions etc. Moving on in part three, a look at the design of Kafka. Part four will be further divided into three segments, with each segment focusing on a single component in Kafka’s ecosystem; Producers and Consumers, Kafka Streams and Kafka Connect, in that order. Part five will take a look at how these components interact with Kafka using APIs and Client Libraries. Finally, rounding up in part six, a look at other third party and community applications that can be integrated with Kafka.

Also, in this series, there will be no hands on or coding examples, no how to do anything, the objective of this series is to get to know Apache Kafka proper.
This series, despite best attempts, can in no way do justice to Kafka, because Kafka is deep. For further reading, more detailed and even more in-depth explanation, you can’t go wrong with any of these books;

  • Kafka, The Definitive Guide by Gwen Shapira, Todd Palino, Rajini Sivaram, Krit Petty Link here

  • Effective Kafka by Emil Koutanov Link here

I hope you’ll enjoy consuming this series. I certainly enjoy producing it.

Coming up, an introduction to our protagonist. Apache Kafka.

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read more →

Top comments (0)

Sentry image

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more

👋 Kindness is contagious

Dive into an ocean of knowledge with this thought-provoking post, revered deeply within the supportive DEV Community. Developers of all levels are welcome to join and enhance our collective intelligence.

Saying a simple "thank you" can brighten someone's day. Share your gratitude in the comments below!

On DEV, sharing ideas eases our path and fortifies our community connections. Found this helpful? Sending a quick thanks to the author can be profoundly valued.

Okay