Sumit Negi

Posted on Feb 13

Apache Kafka: A brief beginner's guide part 1

Kafka: An Open-Source Distributed Event Streaming Platform

Kafka is an open-source distributed event streaming platform that enables publishing, subscribing, storing, and processing streams of records in real-time. It serves as a highly scalable, fault-tolerant, and durable messaging system, commonly used for building real-time data pipelines, streaming analytics, and event-driven architectures.

Core Concepts of Kafka

At its core, Kafka is a distributed publish-subscribe messaging system. Data is written to Kafka topics by producers and consumed from those topics by consumers. Kafka topics can be partitioned, enabling parallel data processing, and can be replicated across multiple brokers for fault tolerance.

Understanding Events and Event Streaming in Kafka

Apache Kafka serves as a distributed event streaming platform used for building real-time data pipelines and streaming applications. Events and event streaming are fundamental concepts in Kafka:

Events

In Kafka, an event represents a piece of data, often in the form of a message, produced by a publisher (producer) and consumed by one or more subscribers (consumers). Events are typically small units of data, such as log entries or transactions. They are immutable, meaning once produced, they cannot be modified. Instead, new events can be produced to reflect updates or changes.

Event Streaming

Event streaming involves continuously capturing, processing, and distributing events in real-time. Kafka provides a distributed, fault-tolerant, and scalable infrastructure for event streaming. It allows applications to publish events to topics and consume events from topics asynchronously, facilitating various use cases such as real-time analytics, log aggregation, and messaging systems.

Key Concepts Related to Events and Event Streaming in Kafka

Brokers

Brokers are servers in the Kafka storage layer that store event streams from one or more sources. A Kafka cluster typically consists of several brokers. Each broker in a cluster serves as a bootstrap server, enabling connections to every broker in the cluster.

Producers

Producers are applications that generate events and publish them to Kafka topics. They determine the topics to which events will be sent and how events are partitioned among different partitions within a topic.

Consumers

Consumers retrieve events from Kafka topics. Each consumer maintains its position within a topic using metadata called the offset. Consumers can read events sequentially or move to specific offsets to reprocess past data.

Topics

Topics are categories or channels to which events are published. Each topic consists of partitions, allowing parallel processing and scalability. Events within topics are immutable and organized for efficient storage and retrieval.

Partitions

Partitions are segments of a topic where events are stored in an ordered sequence. Each partition is replicated across multiple Kafka brokers for fault tolerance.

Offsets

Offsets are unique identifiers assigned to each event within a partition. They track the position of consumers within a partition.

Consumer Groups

Consumer groups collectively consume events from one or more topics. Each consumer group maintains its offset for each partition it consumes from, enabling parallel event processing.

Replication

Replication ensures data redundancy and fault tolerance. Kafka replicates topics across brokers, ensuring multiple copies of data for resilience.

Installation Steps

Install Docker and Docker Compose.
git clone kafka docker repo
cd kafka-stack-docker-compose.
Run: docker-compose -f zk-single-kafka-single.yml up -d
Check to make sure both the services are running: docker-compose -f zk-single-kafka-single.yml ps

Node.js Coding

Create a kafkaClient.js file:

import { Kafka } from "kafkajs"

export const kafka = new Kafka({
  clientId: 'my-app',
  brokers: ['localhost:9092'],
})

This code initializes a Kafka client using the kafkajs library, specifying a client ID and broker address for connecting to a Kafka cluster.

producer.js:

import { kafka } from "./kafkaClient.js"
import { Partitioners } from "kafkajs"

const producer = kafka.producer({
  createPartitioner: Partitioners.LegacyPartitioner
})

await producer.connect()

setInterval(async ()=>{
  await producer.send({
    topic: 'test-topic',
    messages: [
      { value: `Hello, what's up ${new Date()}` },
    ],
  })
},3000)

This code imports a Kafka client from a module, sets up a producer with a legacy partitioner, connects it to Kafka, then sends messages to the 'test-topic' topic every 3 seconds with a timestamp.

consumer.js:

import { kafka } from "./kafkaClient.js"

const consumer = kafka.consumer({ groupId: 'test-group' })

await consumer.connect()
await consumer.subscribe({ topic: 'test-topic', fromBeginning: true })

await consumer.run({
  eachMessage: async ({ topic, partition, message }) => {
    console.log({
      value: message.value.toString(),
    })
  },
})

This code imports a Kafka client from a module, sets up a consumer with a specified group ID, connects it to Kafka, subscribes to the 'test-topic' topic from the beginning, then runs a loop to handle incoming messages, logging their values to the console.

Thank you for reading. Have a wonderful day!

DEV Community