Shyam Varshan

Posted on Feb 19

Kafka KRaft Internals: Life After ZooKeeper

#kafka #zookeeper #kraft #apache

For years, Apache Kafka relied on Apache ZooKeeper for cluster metadata management, controller election, and broker coordination. ZooKeeper worked — but it also introduced operational complexity, scaling bottlenecks, split-brain risks, and an additional distributed system that operators had to understand deeply.

With the introduction of KRaft (Kafka Raft mode), Kafka removed ZooKeeper entirely and replaced it with a native consensus layer built directly into Kafka brokers using the Raft protocol.

This wasn’t just a feature update.
It was a fundamental architectural rewrite.

This blog is a deep technical exploration of:

Why ZooKeeper became a bottleneck
How KRaft works internally
What changed in metadata management
How controller quorum operates
Failure handling mechanics
Performance implications
Migration strategies
Operational tradeoffs
Production pitfalls

If you’re running Kafka at scale — or planning to — understanding KRaft is no longer optional.

The ZooKeeper Era: Why It Had to Go Before KRaft, Kafka used ZooKeeper for:

Broker registration
Controller election
Topic metadata storage
ACL storage
ISR (In-Sync Replica) tracking
The Hidden Complexity
ZooKeeper introduced several systemic issues:

1️⃣ Dual Distributed Systems
You weren’t running one distributed system.
You were running two:

Kafka cluster
ZooKeeper ensemble
Both required:

Independent scaling
Monitoring
Tuning
Backup strategies
2️⃣ Metadata Bottlenecks
ZooKeeper was not designed for:

Massive metadata churn
Large partition counts (100k+)
High-frequency controller updates
As Kafka clusters scaled to hundreds of thousands of partitions, ZooKeeper began to struggle.

3️⃣ Controller Instability
Controller election relied on ephemeral znodes.
Under high load or GC pauses:

Session expirations triggered false elections
Controllers flapped
Rebalances cascaded
Large clusters would experience “controller storms.”

4️⃣ Scaling Ceiling
ZooKeeper’s architecture limited metadata scalability because:

All metadata lived outside Kafka
Writes required ZooKeeper quorum
Metadata propagation depended on watchers
Eventually, Kafka’s data plane outgrew its control plane.

Enter KRaft: Kafka’s Native Consensus Layer KRaft replaces ZooKeeper with:

A Raft-based metadata quorum embedded inside Kafka.

Instead of external coordination, Kafka brokers now manage metadata themselves via an internal replicated log.

The system consists of:

Controller quorum nodes
Metadata log
Broker nodes
Raft consensus mechanism
This means:

Kafka now manages:

Topic creation
Partition assignments
ACL updates
ISR changes
Broker registrations
Internally. Natively.

The Metadata Log: Kafka’s Brain The core innovation in KRaft is the metadata log.

Instead of storing cluster state in ZooKeeper, Kafka now:

Stores metadata changes as log records
Replicates them via Raft
Applies them deterministically
This is similar to how partitions store data records — but for metadata.

Every change (example):

Create topic
Delete topic
Add partition
Change replication factor
Broker joins
Is written as an append-only metadata record.

Why This Is Powerful
1️⃣ Deterministic State Reconstruction
A new controller can reconstruct cluster state by replaying the metadata log.

No ZooKeeper snapshot sync required.

2️⃣ Linearizable Writes
Raft guarantees:

Leader-based ordering
Majority acknowledgment
Strong consistency
This eliminates stale metadata issues.

3️⃣ Scalability
Metadata scales like Kafka logs:

Append-only
Replicated
Log-compacted

The Controller Quorum In KRaft, some nodes act as:

Controller quorum voters

These nodes:

Participate in Raft
Elect a metadata leader
Replicate metadata log
You can run:

Dedicated controller nodes
Or combined broker + controller nodes
Production recommendation for large clusters:

Use dedicated controllers (3 or 5 nodes).

Raft Basics in Kafka
Raft ensures:

Leader election
Log replication
Consistency guarantees
When the leader fails:

Followers elect a new leader
Metadata operations continue
No external system required
This is different from ZooKeeper’s ephemeral node model.

Failure Handling Deep Dive Let’s examine critical scenarios.

Scenario 1: Controller Leader Crash
What happens?

Followers detect missed heartbeats
Election timeout triggers
New leader elected
Metadata operations resume
Since metadata log is replicated:

No state loss occurs (assuming quorum).

Scenario 2: Broker Crash
Broker metadata registration lives in metadata log.

When broker dies:

Controller marks broker offline
Partitions reassign leadership
ISR updates occur
Metadata change logged
Everything flows through Raft.

Scenario 3: Network Partition
If quorum is lost:

Metadata writes stop.

Cluster enters safe mode.

This is correct behavior:
Better to pause than split-brain.

Performance Improvements with KRaft ZooKeeper mode had bottlenecks:

Metadata propagation latency
Controller failover time
Partition scaling limits
KRaft improves:

Faster Controller Failover
ZooKeeper failover: seconds
KRaft failover: sub-second (in optimized setups)

Higher Partition Scalability
Kafka can now scale beyond 1 million partitions (theoretical).

Lower Metadata Latency
Metadata updates no longer depend on ZooKeeper watchers.

Architectural Changes in Brokers In ZooKeeper mode:

Broker startup:

Connect to ZooKeeper
Register ephemeral node
Fetch metadata
Wait for controller
In KRaft:

Broker startup:

Connect to controller quorum
Fetch metadata snapshot
Start replication
Simpler pipeline. Fewer moving parts.

Migration from ZooKeeper to KRaft Migration path includes:

Upgrade Kafka version
Migrate metadata to KRaft format
Remove ZooKeeper dependency
Reconfigure brokers
Key concerns:

Downtime window
Metadata integrity
Compatibility mode
Kafka provides migration tooling — but this is not trivial in large clusters.

Press enter or click to view image in full size

Operational Considerations KRaft simplifies architecture — but introduces new responsibilities.

Controller Sizing
Controllers now handle:

All metadata traffic
All partition leadership decisions
All topic mutations
Under-provisioned controllers → cluster instability.

Metadata Log Growth
Large clusters generate:

Millions of metadata records
Log compaction and snapshotting must be tuned.

Monitoring Must Evolve
New metrics to track:

Controller quorum lag
Metadata log replication latency
Election rates
Follower sync state

Tradeoffs: Is KRaft Always Better? While KRaft removes ZooKeeper complexity, it introduces:

New operational patterns
Raft tuning needs
Quorum capacity planning
ZooKeeper mode is battle-tested for a decade.

KRaft is the future — but still maturing in very large-scale production environments.

When Should You Move to KRaft? Move if:

Starting new cluster
Want simplified architecture
Scaling beyond 100k partitions
Reducing operational overhead
Wait if:

Running ultra-critical stable cluster
Lacking operational maturity
Using legacy tooling dependent on ZooKeeper

Real-World Lessons from Large Deployments Clusters with:

500k+ partitions
10k+ topics
Multi-tenant workloads
Observed:

40–60% faster metadata propagation
Reduced controller instability
Lower operational toil
But also:

Misconfigured quorum size caused outages
Controller CPU saturation under topic churn
KRaft simplifies — but does not eliminate complexity.

The Bigger Picture: Kafka as a Self-Contained System By removing ZooKeeper, Kafka becomes:

Self-governing
Self-coordinating
Fully log-driven
The control plane and data plane now share the same design philosophy:

Append-only logs
Replicated state
Deterministic replay

This architectural consistency is elegant — and powerful.

Press enter or click to view image in full size

Future Implications KRaft enables:

Faster metadata scaling
Tiered storage evolution
Better cloud-native integration
Cleaner multi-region replication
It positions Kafka as a fully independent distributed database for events.

Final Thoughts
KRaft is not just a ZooKeeper replacement.

It is a redefinition of Kafka’s control plane.

By embedding Raft-based consensus directly into Kafka:

Metadata becomes first-class
Failover becomes deterministic
Scaling ceiling increases dramatically
For operators, this means:

Less external dependency.
More internal understanding required.

Kafka has always been a distributed log.

With KRaft, it became a fully self-contained distributed system.