Viviane Urbano

Posted on Jun 21, 2022 • Edited on Jun 25, 2022

Scratching the surface about Apache Kafka

#kafka #streamingdata #systemintegration

🎯 Basic concepts about Apache Kafka

Apache Kafka is an open-source system used to build real-time streaming data pipelines and applications that adapt to the data streams. It combines messaging, storage, and stream processing to allow storage and analysis of both historical and real-time data. Its was built aiming to solve one big problem: integrate data from different providers.

🪢 System integration between several different providers

Suppose the systems that you are using are from the same company. In that case, there will be almost no complexity in data integration, or the challenges will require less effort since the provider probably developed systems that are not so different, so you can choose the most convenient way to do this data exchange.

But, if your system relies on several different providers, you will probably have an integration problem, because each company builds their systems thinking about their own products, not exactly to match and easily integrate with others.

Imagine that you are sending data and, in return, receiving data from an external client. At this stage, integrations between several systems are already starting, each one with its format specifications (e.g., XML, json).

After some time, new systems need to be integrated, so new integrations have to be written to enable the communication between the systems to work.

Given these details, imagine that there are 3 systems that have to communicate with 3 other systems. Just in this design, there would be 9 integrations, so for each of them it would be necessary to write a new integration, since each integration has its own specifications, such as protocol, message format. A lot to deal with, right?

It would also be necessary to evaluate the target system, if it supports all the connections that will be established. The system that is sending data must be concerned about opening several connections, which can generate errors and increase CPU and network demand.

🎯 Why adopt Apache Kafka?

Apache Kafka serves to centralize the exchange of messages between various systems. So, it is no longer necessary for one system to connect to several other systems. Communication will rely only on Kafka.

😎 Read just what you need

Apache Kafka allows you to publish and subscribe to log streams as in a queue. Data streams can be durably treated and are error-tolerant. To understand a little better, imagine that you post a message. Someone can consume this message and not necessarily delete it at this point. Depending on Kafka's configuration, it is possible to define how many days the message will be stored and also define how long it is possible to go back in time to read/reread the message content.

0️⃣ Read data since the beginning

Another very useful feature is that if a new application is created, it will be able to read the same data from the beginning of the queue, without losing messages.

🔖 Get your data as soon as it's created in Kafka

One more important feature is that Kafka processes extremes as they occur. This means that Kafka will notify all interested parties as soon as a new message arrives. All interested parties (who subscribed in a queue) will receive the message immediately.

Kafka can be used especially in systems where communication must take place in real-time and also to build systems that process or react to a data stream.

🗃 Data persistence

As Kafka is able to provide data to multiple systems simultaneously, it is important to have a good organization in this. Can you imagine getting lost in which system you received a message or not?

So, for each communication need (integration between systems) a new topic is created. We will explain in more detail what the topics are later on.

📚 Topic, partition and offset

Topic

The topic is the main unit of Apache Kafka. Each topic works like a data queue. For example, imagine a topic as a table in a database, where records will be stored. In this case, the records have the same structure and are stored sequentially.

You can create as many topics as you need. There is no limit.

Each topic must have a name that will be used to identify what will be sent/received in that topic.

Partition
Internally, the topic is divided into partitions.

By default, when a topic is created, only one partition is created. You need to make explicit the number of partitions you want in a topic.

Topics can have different amounts of partitions. It is not recommended to have only one partition.

As default, at the time of topic creation only one partition is created. But it is possible to update this number later.

Each message will be inserted in an orderly fashion within each partition.

At the time of the message, if there is more than one partition, Kafka will use it all. It will put the messages every hour on a different partition unless a key is adopted for this message. Kafka groups all messages that have the same key in the same partition.

Offset
Each partition has an incremental identifier so-called offset.

The offset works as a marker, indicating where the system stopped consuming the data from that partition, from that topic.

The offset control is done as consumers connect to Kafka and consume messages; Kafka can identify which was the last message delivered to each consumer.

Another aspect that can be configured is the consumption of messages.

When the consumer connects to Kafka, the offset will indicate from which point the data should be consumed, if it is from the beginning of the partition or from the last data read.

The offset control is independent per consumer. So if another consumer later connects to the same partition then it will consume the messages independently according to its own offset.

In general, the consumption of messages occurs in a linear way, but thanks to the offset, it can be adjusted to go back or go to the current position or even consume data from a certain point. This can be very useful in the event of a failure or if it is discovered that messages that were consumed were processed incorrectly.

To wrap up my article, this is just the surface of Apache Kafka. This is an amazing tool for dealing with data and different providers. Take your time studying and practicing to use it. If you would like to learn more, spend time in Apache Kafka site.

If you rather work using an UI tool, try Conductor.

Last but not least, search for tutorials from Confluent about Apache Kafka.

DEV Community