DEV Community

Cover image for What is Apache Kafka and how to perform performance tests on it (Part 1)
Grzegorz Piechnik
Grzegorz Piechnik

Posted on

What is Apache Kafka and how to perform performance tests on it (Part 1)

Event streaming is a real-time data processing and transfer technique that involves the transmission of data streams over various network protocols. It is used in system and application architectures to enable rapid data exchange between different components of a system or application.

One of the platforms for real-time event management is Apache Kafka. What does its architecture look like?

General outline of operation

The intention from the beginning was to be simple - to create a tool that is as powerful as possible, which will be able to process millions of data in a short time. Therefore, the operating principles are rather uncomplicated. It can be generalized to the following illustration.

Apache Kafka Performance 1

As you can see from the attached graphic, there is a distinction of three main definitions:

  1. Producers - these are web applications, microservices, monitoring and analytical systems that send (publish) event data to Kafka. This could be, for example, the execution of a transfer or logging into an account.
  2. Consumers - these are applications and users that read (subscribe to) data from the cluster.
  3. Kafka Cluster - a group of servers working together to hold and process data in the Apache Kafka system.

We mentioned that a Kafka Cluster is a group of servers - what is their architecture like?

Basic Kafka Architecture.

A Kafka cluster is formed by one or more Brokers. They are created to ensure load balancing and system reliability. ZooKeeper is used to manage and coordinate the Brokers. It is a software, used, among other things, to configure the Brokers and provide real-time monitoring of them.

Inside the Brokers are their topicals. These are virtual groups and logs. What does this mean? When a Producer sends a message, it defines its topic. This is a kind of tag that defines the type of message. Then, based on it, Consumers are able to subscribe and read the relevant data arriving at the Kafka cluster.

Topics are divided into smaller parts called partitions. Messages are stored inside them. You will learn about the reason why partitions exist later.

A general representation of apache kafka cluster can be shown in the diagram below.

Apache Kafka Performance 2

Once again about Producers and Consumers

Already knowing the basic architecture of Apache Kafka, let's take another look (this time in a bit more detail) at how data is sent and received from topics.

Producers

Producers publish data to topics by assigning messages to partitions. However, this is not required and when a partition is not indicated, the topic will work on a "load balancer" basis, i.e. it will indicate the appropriate partition by itself.

Consumers and Consumers Group

Let's imagine a situation in which we have an application. It is running on two instances in which each of them wants to read information from a topic to store it on one common database. It would be inefficient for two other consumers to read the same data at the same time, because we would be reading twice as much Apache Kafka server as we need. What to do in such a case?

Apache Kafka introduces the concept of consumers group. This is nothing more than grouping consumers into groups that read together the data they are interested in. The important thing is that per consumer there is a Total Partitions quantity/Consumers quantity number of partitions. This means that the more consumers, the fewer partitions per consumer. In case there are too many consumers, the consumer will not get any messages. This is illustrated in the following diagrams.

If one consumer group is assigned to one consumer, he will get all messages from all partitions.

Apache Kafka Performance 3

If there are two consumers instead of one, the partitions will be divided into two.

Apache Kafka Performance 4

In the case When consumers are even more, partitions are shared between them.

Apache Kafka Performance 5

When the consumers are too many relative to the partition, the consumer will not be matched with any partition.

Apache Kafka Performance 6

When there is more than one consumer group, they are independent of each other, and the partitions (and the messages inside them) are not divided due to the consumer group (and further to the consumer as in the earlier cases). Instead, all messages reach both ConsumerGroup_01 and ConsumerGroup_02.

Apache Kafka Performance 7

In summary - consumer groups help us to read data (messages) from topics in applications that work together. Two consumers from one group will not get the same messages. In case another application needs the same set of data (messages), it will have to be in a separate consumer group.

Top comments (0)