Minwook Je

Posted on Sep 13

Apache Kafka 101

#kafka #ai #devops

Apache kafka 101

Topics
Partitions
Brokers
Replication
Producers
Consumers
Schema Registry
Kafka Connect
Stream Processing

1. Topics

Kafka uses logs to record every event in sequence.
This means that instead of replacing old values, each new event is simply appended to the log, preserving the entire history.

Each event in a topic is called a message(event) and it consists of 3 main parts:

key: ID, it is used to determine which partition to route to.
value: Actual data (json, ...)
timestamp: When event was produced.

Messages in a topic are immutable, so if you want to transform or filter data, you don't modify the original topic; instead create a new topic reflects those changes.

Topics are not queues. (read msg -> gone)
Instead replayability (remain available for multiple consumers to read)
log compaction: only care about the latest state of each key, enable compaction to remove old versions.
retention policies (default 7d)

2. Partitions6

Each partition is stored separately across different nodes, allowing Kafka to handle far larger amounts of data.

If a message has no key, Kafka distributes it using a round-robin.

Global ordering across all partitions is not guaranteed.

Messages with the same key are always written to the same partition, ensuring order for that key. This is managed through a hashing function.

Kafka can scale to 2 million partitions.
KRaft(no more ZooKeeper)
- Quorum is the minimum number of nodes required to make valid decissions across a distributed system.

3. Brokers

Brokers are the servers that store data and handle all data streaming requests.

Each instance of the Kafka server process is a broker.

Each broker stores partitions of topics, allowing Kafka to distribute storage and processing across multiple servers.

Kafka cluster: A group of brokers.

In this cluster, each broker is responsible for handling read and write requests from clients.

Historically, Kafka used Apache ZooKeeper for managing metadata and coordinating the brokers.

However, starting with Kafka 4.0, this changed. ZooKeeper was replaced by KRaft, a built-in metadata management system based on the Raft consensus protocol.

This means brokers now handle their own metadata synchronization.

4. Replication

Replication is how data remains durable and fault-tolerant.

Each partition in a topic is copied across multiple brokers to protect against data loss if a broker fails.

This is controlled by a replication factor, which defines the # of copies (replicas). E.g, if the replication factor is set to three, Kafka will store three copies of each partition across different brokers.

Among these replicas, one is designated as the leader, while the others are followers.

All writes and reads are directed to the leader.
The followers continuously sync data from the leader.

If the leader's broker fails, Kafka automatically elects a new leader from the remaining followers, keeping the system running without data loss.

Q. How?

On leader failure, the controller elects a new leader from the in-sync replicas (ISR).

⚠️ A record is durable only if it was replicated to the ISR before the leader failed.

Loss case: If the leader acknowledged the record before ISR replication (e.g., acks=1, or acks=all with min.insync.replicas=1/ISR=1) and then failed, the record is permanently lost (the producer won't retry).

Retry case: If the leader failed before sending an ack, the producer sees a failure and can retry (use enable.idempotence=true to avoid duplicates). To minimize loss, configure:

acks=all (leader acks only after ISR replication)

min.insync.replicas >= 2 (at least two replicas must acknowledge).

Only leader can do read / write (default mode)

But Kafka also supports follower reads for improved latency;

Follower default: replication only (read x, write x)
Follower reads: read O, replication O

You can configure clients to read from the nearest replica if it reduces network delay.

5. Producers

Producers are the client applications responsible for writing data to a Kafka cluster.

While brokers handle storage and replication, producers are what send messages (key-value pairs) into topics.

To configure a producer, you provide a set of properties.

bootstrap.servers: A list of brokers that the producer can connect to.
acks: The level of acknowledgment required from brokers before considering a message successfully sent.

The main objects in the Producer API

KafkaProducer: Manages the connection to the cluster and handles msg sending.
ProducerRecord: Represents the message(key, value) and the topic. Also allows to set optional fields like timestamp, partition and headers.

Here I'm going to use confluent-kafka-go.

confluent-kafka-go is a lightweight wrapper around librdkafka, a finely tuned C client.

import (
    "fmt"
    "github.com/confluentinc/confluent-kafka-go/v2/kafka"
)

func main() {

    p, err := kafka.NewProducer(&kafka.ConfigMap{"bootstrap.servers": "localhost"})
    if err != nil {
        panic(err)
    }

    defer p.Close()

    // Delivery report handler for produced messages
    go func() {
        for e := range p.Events() {
            switch ev := e.(type) {
            case *kafka.Message:
                if ev.TopicPartition.Error != nil {
                    fmt.Printf("Delivery failed: %v\n", ev.TopicPartition)
                } else {
                    fmt.Printf("Delivered message to %v\n", ev.TopicPartition)
                }
            }
        }
    }()

    // Produce messages to topic (asynchronously)
    topic := "myTopic"
    for _, word := range []string{"Welcome", "to", "the", "Confluent", "Kafka", "Golang", "client"} {
        p.Produce(&kafka.Message{
            TopicPartition: kafka.TopicPartition{Topic: &topic, Partition: kafka.PartitionAny},
            Value:          []byte(word),
        }, nil)
    }

    // Wait for message deliveries before shutting down
    p.Flush(15 * 1000)
}

Producer library handles:

It chooses which partition to send the message to, either through round-robin or by hashing the message key.
It manages retries, acknowledgments, and ensures idempotency (no duplicate writes).

Data sent by producers is simply bytes to Kafka, meaning you can use any serialization format.

The producer is where the partition logic happens. When you send a message, the producer library determines its partition based on its key or by round-robin if no key is provided. This is critical for maintaining ordering and load distribution across the cluster.

6. Consumers

Consumers are the client applications responsible for reading data from Kafka topics.

Consumers retrieve data from topics, process it, and often send it onward for additional processing or storage.

Any application that pulls data from Kafka—whether for analytics, database updates, or real-time processing—is a consumer.

bootstrap.servers: A list of broker addresses to connect to. Don't need to list every broker, Kafka will fetch the full cluster metadata automatically from the brokers we specify.
group.id: A unique id for the consumer group.

The consumer subscribes to one or more topics.

Can use regular expression to subscribe to multiple topics dynamically.
After subs, the consumer enters an inf loop, continuously polling for new msgs.
Consumer fetches ConsumerRecords

- Key: The unique identifier of the message.
- Value: The data payload.
- Partition: The partition the message came from.
- Timestamp: When the event was recorded.
- Headers: Optional metadata.

confluent-kafka-go

import (
    "fmt"
    "time"

    "github.com/confluentinc/confluent-kafka-go/v2/kafka"
)

func main() {

    c, err := kafka.NewConsumer(&kafka.ConfigMap{
        "bootstrap.servers": "localhost",
        "group.id":          "myGroup",
        "auto.offset.reset": "earliest",
    })

    if err != nil {
        panic(err)
    }

    err = c.SubscribeTopics([]string{"myTopic", "^aRegex.*[Tt]opic"}, nil)

    if err != nil {
        panic(err)
    }

    // A signal handler or similar could be used to set this to false to break the loop.
    run := true

    for run {
        msg, err := c.ReadMessage(time.Second)
        if err == nil {
            fmt.Printf("Message on %s: %s\n", msg.TopicPartition, string(msg.Value))
        } else if !err.(kafka.Error).IsTimeout() {
            // The client will automatically try to recover from all errors.
            // Timeout is not considered an error because it is raised by
            // ReadMessage in absence of messages.
            fmt.Printf("Consumer error: %v (%v)\n", err, msg)
        }
    }

    c.Close()
}

Kafka consumer reads are non-destructive. Messages stay in the log until they expire by retention policy.

Kafka's log-based design means consuming data does not delete it. Multiple consumers—even in different consumer groups—can read the same messages independently.

*This allows for separate applications to process the same event streams without conflict.
*

Offset tracking

Kafka tracks the offset of each consumed message.

This is called offset tracking and ensures that if a consumer goes offline, it can resume from where it left off.

Consumer Groups and Parallelism

To scale processing, Kafka supports consumer groups.

All consumers in the same group share the work of reading from a topic's partitions.

Kafka assigns each partition to one consumer in the group
No two consumers in the same group read from the same partition

Partition <-> Consumer (1 Consumer Group)
1:1 ✅  
1:N ❌
N:1 ✅

But the same partition can be read by consumers in different groups (independent offsets by consumer group)

Partition <-> Consumer (M Consumer Groups)
1:M ✅
N:1 ✅
N:M ✅

For example

If a topic has three partitions and three consumers in the group, each consumer processes messages from one partition independently.
If more consumers are added, Kafka automatically rebalances the partitions across them.

If a consumer fails, its partitions are reassigned to the remaining consumers to maintain processing.

7. Confluent Schema Registry

In Kafka, producers and consumers exchange structured data defined by schemas. As schemas evolve (e.g., new fields added, types changed) compatibility becomes critical to avoid breaking applications.

The Confluent Schema Registry manages these schemas centrally, enforcing compatibility rules and ensuring safe evolution.

Q. Why It Matters?
Schema Registry allows producers and consumers to evolve independently:

Producers can upgrade first, relying on forward compatibility, so older consumers continue to process messages.

Consumers can upgrade or rollback independently, relying on backward compatibility, so they still read messages written by older producers.

If a deployment fails, rollback is safe: compatibility rules ensure messages are still valid, and consumers use schema IDs in messages to fetch the right schema version.

Compatibility Settings

Schema Registry enforces compatibility at the subject (topic) level:

BACKWARD -> Safe for consumer-first upgrades or rollback of consumers.
FORWARD -> Safe for producer-first upgrades or rollback of producers.
FULL -> Safe for mixed deployments; both forward and backward compatibility guaranteed.
Transitive modes (BACKWARD_TRANSITIVE, FORWARD_TRANSITIVE, FULL_TRANSITIVE) ensure the entire schema history remains compatible.

Safe Strategies for Evolution

Schema IDs per message: Each message carries its schema ID, so consumers can always fetch the exact version and deserialize safely. This enables mixed message versions in a single topic.
Topic versioning: Some teams enforce schema version per topic (e.g., orders-v1, orders-v2) for stricter control, making rollback as simple as switching producers/consumers back to the previous topic.

Serialization Formats

Schema Registry supports Avro, JSON Schema, and Protobuf. Producers register schemas, include schema IDs in messages, and consumers fetch the correct schema version to deserialize.

8. Kafka Connect

Kafka Connect makes Kafka extensible.

Source (Producer -source connect-> Kafka)
Sink (Kafka -sink connect-> Consumer)

Rather than writing custom code to sync data from source to sink, you can use plug-and-play connectors that handle it for you. No need to reinvent the wheel.

Stateless architecture: Each connector task processes records independently without keeping session state. This design makes connectors horizontally scalable and fault-tolerant, since tasks can be restarted or rebalanced without complex recovery.

9. Stream Processing

Stream processing frameworks are purpose-built to manage state, ordering, and fault tolerance at scale.

Advanced stream processors allow you to perform a wide range of tasks including:

Windowed aggregations (e.g., count pageviews per minute)
Stream-to-table joins (e.g., enrich clickstream data with user profiles)
Event filtering and deduplication
Resilience to late or out-of-order data

Two Powerful Tools

Flink and Kafka Streams

1) Apache Flink(Defacto) – a distributed engine built for large-scale stream (and batch) processing. It supports SQL-based queries, event time processing, and high-throughput aggregations.

2) Apache Kafka Streams(Java ecosystem) – a lightweight Java library that turns your Kafka consumers into stream processors. It’s embedded, easy to deploy, and great for microservices.