Introduction: Navigating the Data Maze
Feel free to skip this section if you want to dive right into the technical bits.
Imagine you want to order your favourite burger from your neighbourhood spot “Akaso Burger” - you visit the website/mobile app, select your item, and tap the order button. The button is disabled, displaying a spinner to indicate processing, and then re-enables. You get a success notification on the website or the app, quickly followed by an email confirmation 2 seconds later. Shortly after, an email arrives from your bank or PayPal, notifying you of the deducted amount due to your burger purchase. Now let’s imagine it’s almost Christmas, and Akaso Burger realizes that besides you ordering daily, ten thousand other people in your neighbourhood seem to love having their burgers. The company decides to have a 24-hour window where all burgers are 70% off, but a user can only order 5 burgers within this time! You want to take advantage of this, but so do over 10,000 people around you. You repeat the process you’re familiar with - order -> wait for the button to be disabled -> get notification -> get email confirmation, but this time things are a bit slower.
Most of the internet is powered by http: a straightforward request-response protocol. This means that a client initiates a request to a server, and waits some time for a response. In this case, the client is the website or the mobile app that triggers the order request when you click the button. The server is the backend handling the request - it does some operations (e.g. confirming there are enough resources) and then returns a response to the client. This is the standard we’re used to, but what happens when a website has to handle lots and lots of requests? The website can simply continue with the request-response paradigm we know and love, but this means the wait time for a response greatly increases as more people start ordering from Akaso Burger. When the server receives an order, it:
Validates that the user hasn’t ordered more than 5 burgers in 24 hours
Checks if there is enough resources (ingredients, manpower, time) to handle the order
Charges the linked card or PayPal
Confirms the order on the app and at the same time:
Sends a confirmation email
If we go with the http request-response approach, we would need to do the steps above for each order, going through the flow linearly before arriving at the last step. As more users rush to Akaso Burger’s app, this starts getting noticeably slow because each request has to wait for its turn to be processed. As the number of requests increases, the system's capacity to handle them concurrently becomes a bottleneck. What if we forego our full synchronous approach and rethink our architecture? For example, we can notify the user that the order has been received and is being processed. A few seconds later, we notify the user that the order has been confirmed. We can decide to do step 1 above instantly and give the user immediate feedback while we handle the processing in the background. Since we have now broken out of the typical request-response cycle we know and love, how do we handle this huge stream of data now lying somewhere in the background as close to real-time as possible? This is where tools such as RabbitMQ and Apache Kafka come in.
Asynchronous Processing: The Contenders
When it comes to asynchronous processing, RabbitMQ and Kafka emerge as leading contenders. Both serve as robust message broker systems designed for near-real-time, scalable, and high-speed asynchronous processing, but they adopt different approaches. IBM defines a message broker as "a software that enables applications, systems, and services to communicate with each other and exchange information". Let's explore the distinctions before delving into the unique strengths of Kafka.
Kafka's Approach:
Durability: Messages in Kafka persist even after delivery/consumption.
Message Replay: Kafka can replay messages, allowing the resend of the same message.
Speed and Scalability: Built for high throughput, Kafka employs sequential I/O, avoiding random-access memory overhead. By not deleting messages, Kafka conserves compute cycles.
Consumer Pull Model: Unlike pushing messages to consumers, Kafka consumers pull or poll messages from the broker, offering consumer flexibility and efficient resource utilization.
RabbitMQ's Features:
Priority Queues: RabbitMQ supports priority queues, allowing some messages to be routed to higher-priority queues for expedited processing.
Acknowledgment System: Consumers in RabbitMQ acknowledge message receipt, and this information is relayed to the producer, ensuring a reliable message delivery system (at the cost of higher latency).
Ease of Learning: With a simpler architecture, RabbitMQ is considered more straightforward for users new to asynchronous processing.
In the following sections, we'll dive deeper into Kafka's unique architecture.
Components of Kafka
In the previous section, we mentioned that the messages (data) are “delivered”. What exactly does this mean? Kafka has several components that enable it to send and receive messages. Producers send these messages to the broker, which in turn sends (delivers) the messages to one or more consumers. Since the broker itself does not know what the message is, the consumer needs to know what message is relevant to itself. This is aided by using topics - the producer publishes the message to a specific topic on the broker and the consumer(s) listen for messages only for that topic.
Let’s use a simple metaphor to explain this: Imagine you are a postman delivering packages in a building filled with engineers, mathematicians, and doctors. Each group has a numbered mailbox to receive packages relevant to them (e.g. 1 for Mathematicians, 2 for Engineers, etc). You label each package with the appropriate number and drop it off with the concierge. The concierge uses the number to identify what mailbox to place the package in. Every once in a while, a mathematician, engineer, or doctor walks to their designated mailbox to check if a new package has arrived. In this analogy, you are the producer, the concierge is the broker, the number is the topic, and the people checking the mailboxes are the consumers.
Broker
A central server with the Kafka program running, a broker is the cornerstone of the Kafka system. Because Kafka was designed with scalability and a high degree of fault tolerance and availability in mind, most setups do not use just one broker, but rather a combination of brokers working together called a cluster. These brokers can be deployed in different availability zones to minimize the risk of downtime. The broker is responsible for:
Message Persistence: Stores and manages messages produced by Kafka producers.
Topic and Partition Management: Organizes messages into topics, divides topics into partitions, and manages the partition creation, replication, and reassignment. Each partition has one leader and multiple followers. The broker is responsible for leader election and ensuring that the leader is actively serving read and write requests.
Producer Communication: Acts as the endpoint for Kafka producers to send messages. Producers connect to brokers and publish messages to specific topics. The broker is responsible for receiving and acknowledging these messages.
Consumer Communication: Consumers connect to brokers to subscribe to topics and receive messages. It also maintains an offset for each consumer to keep track of the last consumed message within a partition, as well as managing and updating the offset (position) for each consumer within a partition. This allows consumers to resume reading from where they left off in case of failures or restarts.
Log Compaction: Supports log compaction for topics, retaining only the latest value for each key.
Security and Access Control: Implements security features such as authentication and authorization to control access to topics.
Monitoring and Metrics: Provides metrics for tracking the health, performance, and resource utilization of the Kafka cluster.
Dynamic Configuration: Support dynamic configuration changes so administrators to modify configurations without requiring a restart.
Producer
Producers are client code interacting with the Kafka broker, responsible for sending messages to specified topics. While multiple producers can be created, reusing a single producer generally offers better performance.
When the producer starts up, it establishes a TCP connection with the broker(s) to get metadata such as topics, partitions, leaders, and clusters. It also opens another TCP connection for message sending when the producer send()
function is called. Subsequent send()
calls to the same topic reuse the same TCP connection. The default TCP connection between the producer and broker is unencrypted plaintext, eliminating the overhead of the brokers decrypting the messages. Because plaintext is not suitable if any of the servers are public, Kafka allows the producer (and consumer) to select a connection protocol during initialization. It is recommended to use a secure option like SSL (might not be needed when both servers are in a VPC). Periodically, according to the config metadata.max.age.ms, a refresh of the metadata happens to proactively discover new brokers or partitions. For a deeper dive into producer internals, check this awesome post.
The exact timing of when a message is sent can vary depending on the configuration and the acknowledgment settings. For the most part, message sending happens asynchronously. When the producer send()
function is called, it receives an acknowledgment from the broker even if the message has not been fully committed. When batching is enabled (the default), the producer adds any message to its internal buffer and attempts to send it immediately according to the batch size (default of 16KB). If the linger config is enabled, the producer will wait for the linger milliseconds before sending the batch. This is aimed at increasing throughput but at the expense of increasing latency. For example, a linger.ms
of 5 means that a 5ms artificial delay is introduced before the producer sends the batch. This increases the chances of messages being sent in a batch since the producer would “linger” for 5ms to see if more messages arrive and add them to the batch. It should be noted though that if we already have the batch size worth of messages, this setting is ignored and the producer sends the batch immediately. In summary, producers will send out the next batch of messages whenever linger.ms
or batch.size
is met first.
We can force the messages to be sent (irrespective of the linger and batch settings) by flushing the producer. This is a blocking operation and effectively makes the producer synchronous because it has to wait for acknowledgement of delivery from the broker. As a result, flushing should be used sparingly (e.g low throughput environments or in tests). Another point worth mentioning is that the buffer memory is finite, so if more messages arrive over the buffer.memory
limit, the producer will be blocked for a configurable time max.block.ms
after which it throws an exception. The buffer memory might be full if the producer is receiving messages faster than it is sending, or if the broker is down for any reason.
Topic
A Kafka topic serves as a way to organize messages, akin to organizing files in folders. Producers choose a topic to write to, while consumers select the topic they wish to read from. Topics are created and managed using CLI commands specific to the operating system: kafka-topics.bat
for Windows and kafka-topics.sh
for Mac and Linux (usually in a startup script).
Topics, including those created dynamically, are stored and replicated on Kafka brokers. Kafka automatically creates certain topics, such as the __consumer_offsets topic, as part of its operation. The ability to dynamically create topics is governed by the auto.create.topics.enable
setting on the broker. When enabled, the broker creates a topic and partitions when:
Producer writes to a topic that isn’t currently created.
Producer fetches metadata for a topic that does not exist
Consumer reads from a topic that does not exist
However, creating topics on the fly can lead to maintenance challenges as a typo can cause unwanted topics to be created. Topics created this way also share the same replication factor, number of partitions, and retention settings.
Kafka uses topics to parallelize processing and scale horizontally by splitting messages into different partitions in different brokers. We will discuss partitions in more detail later on.
Consumer
A Kafka consumer is a client application that reads data from a specific topic, more precisely from a particular partition within that topic. Similar to producers, consumers establish a TCP connection with the broker when started. Unlike traditional systems where messages are pushed to consumers, Kafka's design encourages consumers to poll the broker for new messages at their own pace. The reasons for this pull-based approach are detailed in Kafka's consumer design documentation.
Consumers continuously poll the broker for new messages, returning a number of messages defined by max.poll.records
alongside the offsets for these messages. Kafka uses offset to keep track of the position of the message the consumer has read. The default behaviour of the consumer is to auto commit offsets enable.auto.commit
, which means that every 5 seconds (or auto.commit.interval.ms
), the consumer updates the broker with the current offset. This is handled by the client libraries making a request to the broker to update the internal __consumer_offsets
topic. Consider disabling auto-commit and using manual commit in production to avoid potential issues with offsets. This is well captured in the “Auto Commit” section of this article which is pasted here for visibility:
With auto commit enabled, kafka consumer client will always commit the last offset returned by the poll method even if they were not processed. For example, if poll returned messages with offsets 0 to 1000, and the consumer could process only up to 500 of them and crashed after auto commit interval. Next time when it resumes, it will see last commit offset as 1000, and will start from 1001. This way it ended up losing message offsets from 501 till 1000. Hence with auto commit, it is critical to make sure we process all offsets returned by the last poll method before calling it again. Sometimes auto commit could also lead to duplicate processing of messages in case consumer crashes before the next auto commit interval.
Consumers periodically send heartbeats to the broker to signal their activity and ensure the broker is aware of their current status. This occurs during message polling or when committing offsets, whether automatically or manually.
Imagine a situation where a topic is receiving 20,000+ messages in a topic per second. If we have a single consumer reading from that topic, message consumption would be considerably slow. Kafka was built for rapid real time message handling, so how does this happen? With the help of partitions and consumer groups which will be discussed in the next sections.
Beyond the Basics: Digging Deeper
Partitions and Replication
In previous sections, we established that producers write data to a topic by selecting a partition. It is important to think about Kafka’s mental model at the partition point level and not at the topic level. What exactly are partitions and why is this important for Kafka.
Let’s imagine you are participating in a Burger eating competition where you have to eat 1000 burgers. It would take you days to consume all 1000 burgers, but what if you could have your brother help eat some burgers? In this case, you could place 500 burgers in Box A and give your brother 500 burgers in Box B. Now this means you both can finish the combined 1k burgers in shorter time. What if you include your sister and nephew as well? Now you can have 4 boxes each containing 250 burgers that would be finished much faster because all 4 of you are eating the burgers at the same time, in parallel. Parallelism helps you rapidly speed up your consumption. What you have essentially done is to partition your burgers into 4 boxes so you can have 4 consumers finish these burgers much quicker. Let’s take the analogy up a notch - imagine if it is crucial for these 1k burgers to be consumed, and all 4 of you are in the same room eating the burgers. If an emergency happens in the room e.g fire alarms goes off, all four people have to stop eating the burgers. What if the first two people take 250 burgers each in one building, and the other 2 people take 250 burgers into the adjacent building? Now if any incident happens in one building, at least two people will still be eating burgers and the other two can join them later. What you have done now is called redundancy (or fault tolerance) - a fault in one building does not stop operations because the other building is available to continue.
This is exactly how Kafka handles partitions!
Partitions act as logical divisions that aid in load balancing, parallel processing, and fault tolerance.
When a topic is created, the number of partitions can...
Read the full article, including a breakdown of how partitions, replications work and consumer groups work, on: https://ladmerc.com/stream-symphony-more-than-enough-kafka
Top comments (1)
Nice introduction. Why the
#python
tag though? I mean, you can surely integrate and use Kafka from your Python-based project, but itself it is written in Java/Scala.