Introduction
Apache Kafka is an open source distributed event-streaming platform or a distributed commit log. It was developed at LinkedIn by a team which included Jay Kreps, Jun Rao, and Neha Narkhede. Apache Kafka is built to optimize and ingest data in real-time hence it can be used to implement high-performance data-pipelines,streaming analytics applications, and data integration services.
Apache Kafka Key Features and Concepts
Distributed System: Kafka works as a cluster of one or more nodes that can live in different datacenters, we can distribute data/ load across different nodes in the Kafka Cluster, and it is inherently scalable, available, and fault-tolerant.
Event Streaming: An event is any type of action, incident, or change that's identified or recorded by software or applications. For example, a payment, a website click, or a temperature reading, along with a description of what happened.Kafka excels at handling continuous streams of data, making it ideal for real-time applications and data pipelines
Scalability: Kafka can scale horizontally to handle increasing data volumes and user loads. Kafka clusters can be scaled up to a thousand brokers, handling trillions of messages per day and petabytes of data. Kafka's partitioned log model allows for elastic expansion and contraction of storage and processing capacities. This scalability ensures that Kafka can support a vast array of data sources and streams.
Durability: Kafka ensures data durability by storing data in a durable manner, preventing data loss even in the event of system failures.
Kafka Streams: Kafka Streams is a client library that allows developers to build real-time streaming applications directly on top of Kafka. It enables processing data streams in real-time, filtering, joining, aggregating, and grouping data without writing complex code.
Kafka Connect: Kafka Connect is a framework for connecting Kafka to external systems, allowing data to be moved into and out of Kafka.
-
ksqlDB: ksqlDB is a stream processing engine that extends the Kafka Streams API, allowing developers to query and analyze streams using SQL-like syntax.
Getting Started with Kafka
It is often recommended that apache kafka is started with zookeeper for optimum compatibility. Also, installing kafka on windows may run into several problems because it is not natively designed for use with the windows system. On Windows it is advised to use:
- WSK
- Docker Else use ubuntu to install and run Kafka. For either OS make sure you have java 11 or 17 Ensure Java is installed by running
java --version
Else run
sudo apt install openjdk-11-jdk
The first command will check java version, in case java is not installed you need to run the second command for installation.
Once that's done head over to kafka download page and download kafka either source of binary download. I will use 3.6.0 source downloads. Go to your downloads folder and open a terminal and write the following command to unzip it.
tar -xzf kafka-3.6.0-src.tgz
This will unzip the folder and create a folder for us. Now, we can rename the folder by running the following command
mv kafka-3.6.0-src kafka
It will move all contents of kafka-3.6.0 folder to the new kafka folder.
Start Zookeeper
Zookeeper is required for cluster management in kafka hence it must be launched before kafka and zookeeper it is part of kafka.
To start zookeeper you can run
kafka/bin/zookeeper-server-start.sh kafka/config/zookeeper.properties
The kafka/bin/zookeeper-server-start.sh
is the path to zookeeper server, that is starting zookeeper.
The kafka/config/zookeeper.properties
is the path to config files for zookeeper server
Start Kafka Server
Open another terminal, and run the following command
kafka/bin/kafka-server-start.sh kafka/config/server.properties
The kafka/bin/kafka-server-start.sh
command starts the kafka server and the kafka/config/server.properties
is the path to the configuration file for apache kafka.
Create a topic
Once the zookeeper server and kafka server are both running, we can now create a topic. Open another terminal window and run the following command.
kafka/bin/kafka-topics.sh --create --topic testourtopic --bootstrap-server 127.0.0.1:9092 --partitions 1 --replication-factor 1
testourtopic is the topic name that will be create once the command is executed. By default apache kafka runs on port 9092.
kafka/bin/kafka-topics.sh
This is the script used to manage Kafka topics. It is located inside the Kafka installation directory (kafka/bin).
--create
This flag tells Kafka to create a new topic.
--topic testourtopic
command specifies the topic from which the consumer will consume messages
--bootstrap-server 127.0.0.1:9092
This defines the Kafka broker address. 127.0.0.1:9092 means Kafka is running on the local machine (localhost) on port 9092.
--partitions 1
This sets the number of partitions for the topic to 1.
--replication-factor 1
This sets the replication factor to 1.
To list topics you need to run the following command
kafka/bin/kafka-topics.sh --list --bootstrap-server localhost:9092
Apache Kafka Architecture
Apache Kafka's architecture revolves around a distributed, fault-tolerant system for handling real-time data streams, featuring key components like producers, consumers, brokers, topics, and partitions, enabling high-throughput and low-latency data processing.
Brokers: These are servers that manage data streams. Kafka clusters consist of one or more brokers. A broker works as a container that can hold multiple topics with different partitions. A unique integer ID is used to identify brokers in the Kafka cluster.
Topics: Topics: Topics are named channels or categories through which messages are sent and received. They are a stream of messages that are a part of a specific category or feed name is referred to as a Kafka topic. In Kafka, data is stored in the form of topics.
Producers: Applications that write data (messages) to Kafka topics. They publish messages to one or more topics. They send data to the Kafka cluster. Whenever a Kafka producer publishes a message to Kafka, the broker receives the message and appends it to a particular partition. Producers are given a choice to publish messages to a partition of their choice.
Consumers & Consumer Group: Applications that read data from Kafka topics. read data from the Kafka cluster. The data to be read by the consumers has to be pulled from the broker when the consumer is ready to receive the message. A consumer group in Kafka refers to a number of consumers that pull data from the same topic or same set of topics
Partitions: Topics are divided into partitions, which are ordered, immutable sequences of messages, enabling horizontal scalability and parallel processing. Topics in Kafka are divided into a configurable number of parts.
Replication: Kafka replicates data across multiple brokers within a cluster, ensuring data durability and fault tolerance.
Leader and Follower: In a replicated partition, one broker acts as the leader, handling all writes, while other brokers (followers) replicate the data.
Offsets: Each message within a partition has a unique offset, which is a sequential number that identifies its position in the partition.
Top comments (0)