What is Kafka?
Kafka is a publish/subscribe (pub/sub) messaging system that provides data streaming capabilities while also taking advantage of distributed computing.
What is a pub/sub messaging system?
A pub/sub messaging system contains two components that relay some form of data or information between each other. One component publishes data while the other component subscribes to the publisher to receive the published data.
Kafka follows this pattern with its own set of components and features.
Producers
The first component in a pub/sub messaging system is the publisher which is referred to as a Producer in Kafka. The producer is a data source that publishes or produces a message into Kafka. One of the great features of Kafka is that it is data type independent. This means that Kafka does not care about what type of data is being produced, whether it’s the GPS signal of a car, application metrics from front-end servers, or even images!
Consumers
The second component in a pub/sub messaging system is the subscriber, which is referred to as a Consumer in Kafka. The consumer can subscribe or listen to a data stream and consume messages from that stream while having no relationship or knowledge about the producers.
Consumers can subscribe to multiple streams of data regardless of the type of data being consumed. In other words, you can have a single application that takes in data from as many different sources as you’d like. Kafka makes it easy to access the data you need while leaving the processing steps entirely in your control.
High-level Architecture
Now that you know where messages come from (producers) and how messages can be retrieved (consumers), let’s discuss what happens in between.
Simple Kafka flow
Lets imagine a simple Kafka flow with three producers and two consumers. Each producer must specify a destination for the message and each consumer must specify from where it needs to consume. This middle ground between the producer and consumer, where the Kafka message is stored, is called a Topic.
Topics, Partitions, and Offsets
Topics can be thought of as a table in a database, where producers can write to and where consumers can read from. Each topic contains Partitions which are essentially logs that commit and append Kafka messages as they arrive. To identify messages, partitions use an auto-incrementing integer called an Offset, which is unique within partitions.
Offsets provide consumers the flexibility of reading messages when and from where they want, this is done by committing the offset. A commit from a consumer is like checking items on a list, once a message has been consumed, the commit tells Kafka to mark that offset as processed for that consumer.
As a consumer, you have the ability to read partitions from a specified offset or from the last committed message. How can this be useful? Consider an application that receives data every 2 hours. In this case, having the application continuously running and waiting for messages can be very expensive. By reading from the last committed message, you could have the application go live every 8 hours to simply consume all new messages and commit the offset of the latest message. This can reduce costs and usage of resources significantly.
Brokers and Clusters
Now, you know about producers, consumers, and how Kafka messages flow within Kafka, but one of the most important components remain, the Kafka Broker. The broker is what ties the whole system together; it is the Kafka server that is responsible for dealing with all communications involving producers, consumers, and even other brokers. Producers rely on the broker to correctly accept and store the incoming Kafka message to its appropriate topic. Consumers rely on the broker to handle their fetch and commit requests while consuming from topics.
A group of brokers is called a Kafka Cluster. One of the biggest perks of using Kafka is its use of distributed computing. A distributed system shares its workload among many other computers called nodes. These nodes all work together and communicate to complete the work rather than having all the work assigned to a single node. When we have multiple Kafka brokers and clusters dealing with large amounts of data, distributed computing saves resources and increases overall performance; making Kafka a desirable choice for big data applications.
Benefits of Apache Kafka
There are four key benefits of using Kafka:
- Reliability: Kafka distributes, replicates, and partitions data. Additionally, Kafka is fault-tolerant.
- Scalability: Kafka’s design allows you to handle enormous volumes of data. And it can scale up without any downtime.
- Durability: Messages received are persisted as quickly as possible to storage. So, we can say Kafka is durable.
- Performance: Finally, Kafka maintains the same level of performance even under extreme loads of data (many terabytes of message data). Kafka can perform up to two million writes per second. So, you can see that Kafka can store large amounts of data with zero downtime and no data loss.
Disadvantages of Apache Kafka
After discussing the advantages, let’s take a look at the disadvantages:
- Limited flexibility: Kafka doesn’t support rich queries. For example, it’s not possible to filter for specific asset data in messages. (Functions like this are the responsibility of the consumer application reading the messages.) With Kafka, you can simply retrieve messages from a particular offset. The messages will be ordered as Kafka received them from the message producer. Not designed for holding historical data: Kafka is great for streaming data, but the design doesn’t allow you to store historical data inside Kafka for more than a few hours. Additionally, data is duplicated, which means storage can quickly become expensive for large amounts of data. You should use Kafka as transient storage where data gets consumed as quickly as possible.
- No wildcard topic support: Last on the list of disadvantages is that it’s not possible to consume from multiple topics with one consumer. If you want to consume, for example, both log-2019-01 and log-2019-02, it’s not possible to use a wildcard for topic selection like log-2019-*. The above disadvantages are design limitations intended to improve Kafka’s performance. For some use cases that expect more flexibility, the above limitations can constrain an application consuming from Kafka.
Summary
Kafka is a great tool when it comes to handling and processing data especially with big data applications. It's a reliable platform that provides low-latency and high throughput with its data-streaming capabilities and gives an ample amount of helpful features and services to make your application better.
Top comments (0)