Kafka Self-Healing Cluster:

#kafka #devops #infrastructureascode

Hey there, welcome back to the 2nd episode of the monthly misadventures of a regular dev. Thanks for sticking around, especially after how rough the last one was. Let’s get straight to the point, have you ever had your organization look at the Confluent or Kinesis MoM bills and think, “Why can’t we just self-host this again?” You tried explaining that scaling and meeting downtime SLAs would be tough, especially with a lean team, but those warnings fell on deaf ears. Now you’re facing the consequences, those infrequent outages with your self-hosted Kafka servers.Those unwelcomed weekend outages. Don’t worry, I’ve been there too. In this post, I’ll share how I tackled this challenge. The implementation details? They’re reserved for an upcoming blog.

What’s the issue:
Our self-hosted Kafka cluster has experienced intermittent downtimes caused by corrupted log files. Although these failures are infrequent, they require manual developer intervention to resolve—typically SSHing into the affected node, purging corrupted data, and restarting the broker.
Proposed Solution:
We plan to add a sidecar to each Kafka node that exposes an API interface on the node for health checks, purging, starting, stopping, and restarting brokers. This sidecar abstracts the complexity of the underlying infrastructure and Kafka internals, giving developers a simple way to interact with each node.
The second component is a centralized controller, hosted separately. It polls the health status of individual nodes, every 5 seconds. If a broker becomes unresponsive, the controller will trigger an automated purge and restart sequence, retrying up to three times. If the broker remains unavailable, a cluster-wide purge and restart will be attempted. Should that fail, an infrastructure operations alert will be sent, tagging the tech team to manually intervene.
Although rare—over 1.5 years of managing our Kafka server, purging and restarting has always resolved availability issues,we included these escalation steps to prepare for unexpected scenarios. Besides polling, the controller also offers an API to manage cluster-wide operations with the same commands available to the sidecars.
Limitations:
This solution effectively monitors and maintains broker health but currently lacks dynamic load management. We run a fixed number of brokers and controllers and do not support automatically adding or removing nodes based on load. However, in our production environment, we remain well within available compute capacity.
Another limitation is visibility. While the system operates reliably, it lacks monitoring tools such as connection failure metrics, utilization stats, or a dashboard—interactions are limited to the API interface, with no UI available.As they say if you cant validate or monitor it running smoothly it isn't running smoothly enough
Future Scope:
Future improvements will focus on adding a visibility dashboard with metrics, a user interface for cluster interaction, and support for dynamic cluster resizing via API and UI.

Top comments (1)

Bhavya Thakkar • Oct 5

What issues have you faced with Kafka reliability?