Grzegorz Piechnik

Posted on Dec 12, 2023

What is Apache Kafka and how to perform performance tests on it (Part 1)

#performance #devops #testing #tutorial

Event streaming is a real-time data processing and transfer technique that involves the transmission of data streams over various network protocols. It is used in system and application architectures to enable rapid data exchange between different components of a system or application.

One of the platforms for real-time event management is Apache Kafka. What does its architecture look like?

General outline of operation

The intention from the beginning was to be simple - to create a tool that is as powerful as possible, which will be able to process millions of data in a short time. Therefore, the operating principles are rather uncomplicated. It can be generalized to the following illustration.

As you can see from the attached graphic, there is a distinction of three main definitions:

Producers - these are web applications, microservices, monitoring and analytical systems that send (publish) event data to Kafka. This could be, for example, the execution of a transfer or logging into an account.
Consumers - these are applications and users that read (subscribe to) data from the cluster.
Kafka Cluster - a group of servers working together to hold and process data in the Apache Kafka system.

We mentioned that a Kafka Cluster is a group of servers - what is their architecture like?

Basic Kafka Architecture.

A Kafka cluster is formed by one or more Brokers. They are created to ensure load balancing and system reliability. ZooKeeper is used to manage and coordinate the Brokers. It is a software, used, among other things, to configure the Brokers and provide real-time monitoring of them.

Inside the Brokers are their topicals. These are virtual groups and logs. What does this mean? When a Producer sends a message, it defines its topic. This is a kind of tag that defines the type of message. Then, based on it, Consumers are able to subscribe and read the relevant data arriving at the Kafka cluster.

Topics are divided into smaller parts called partitions. Messages are stored inside them. You will learn about the reason why partitions exist later.

A general representation of apache kafka cluster can be shown in the diagram below.

Once again about Producers and Consumers

Already knowing the basic architecture of Apache Kafka, let's take another look (this time in a bit more detail) at how data is sent and received from topics.

Producers

Producers publish data to topics by assigning messages to partitions. However, this is not required and when a partition is not indicated, the topic will work on a "load balancer" basis, i.e. it will indicate the appropriate partition by itself.

Consumers and Consumers Group

Let's imagine a situation in which we have an application. It is running on two instances in which each of them wants to read information from a topic to store it on one common database. It would be inefficient for two other consumers to read the same data at the same time, because we would be reading twice as much Apache Kafka server as we need. What to do in such a case?

Apache Kafka introduces the concept of consumers group. This is nothing more than grouping consumers into groups that read together the data they are interested in. The important thing is that per consumer there is a Total Partitions quantity/Consumers quantity number of partitions. This means that the more consumers, the fewer partitions per consumer. In case there are too many consumers, the consumer will not get any messages. This is illustrated in the following diagrams.

If one consumer group is assigned to one consumer, he will get all messages from all partitions.

If there are two consumers instead of one, the partitions will be divided into two.

In the case When consumers are even more, partitions are shared between them.

When the consumers are too many relative to the partition, the consumer will not be matched with any partition.

When there is more than one consumer group, they are independent of each other, and the partitions (and the messages inside them) are not divided due to the consumer group (and further to the consumer as in the earlier cases). Instead, all messages reach both ConsumerGroup_01 and ConsumerGroup_02.

In summary - consumer groups help us to read data (messages) from topics in applications that work together. Two consumers from one group will not get the same messages. In case another application needs the same set of data (messages), it will have to be in a separate consumer group.

DEV Community

What is Apache Kafka and how to perform performance tests on it (Part 1)

General outline of operation

Basic Kafka Architecture.

Once again about Producers and Consumers

Producers

Consumers and Consumers Group

Top comments (0)

Read next

Docker Networking: A Comprehensive Guide

🔥10 Git Features You Might Not Know About

Day 1: Getting Started with SQL - Basics | Beginners' Guide : Mastering

Dynamic Svelte Components