DEV Community

John Preston for AWS Community Builders

Posted on

Kafka Connect Watcher - actively monitor your clusters

TL;DR

Kafka Connect Watcher is a service that will actively monitor your connect clusters & its connectors, allow you to define automated actions to take and notify you when problems occur

Intro

Running Kafka connect clusters, is not particularly difficult, because it's been very well designed. It is orders of magnitude harder to manage the Kafka cluster(s). Furthermore, in 2023, there are several ways to deploy the workers & connectors: VMs, containers, and even managed services such as MSK Connect.

I love managed services, so I eventually looked into MSK Connect. However, at the time of writing, MSK Connect is A) very expensive B) is very limited in connection options.

"But mate, if it's easy to run a cluster, what's the issue?" I hear you ask. Well, let's dive into it, shall we?

The why

Let's start with a little bit of background: over the past few years, I have had the responsibility to deploy and maintain several connect clusters, and implemented CICD pipelines for developers to deploy their connectors.

Running the connect cluster has an infrastructure cost: run a few containers on ECS + Fargate (not going to get into Kafka costs, that's another issue on its own), and really that's where the complications can stop on the cluster. A connect cluster will run without any maintenance required.

Connectors however, that's a different story. Especially if, like me, to reduce cost and avoid having to re-invent the wheel, you come up with the idea to host a connect cluster and offer "connectors as a service" to your application teams.

The connect framework on its own offers, to my knowledge, only one way to extract monitoring metrics on the health of the connectors: JVM exporter. If you refer to this blog post which details how I achieved this, you will see that there is, technically, an easy way to collect data points on the health of your connectors.

However, this won't give you information such as, why it failed. And unless you go down the rabbit hole of a rather evolved logic, getting this information can be time consuming.

And sometimes, all a connector will take to work again, is to pause, restart the tasks, and resume it.

So to solve these things, I decided to write Kafka Connect Watcher

The How

Originally, I thought as a savy AWS engineer, I would use my existing CloudWatch metrics, notify a Lambda function, parse the alarm payload, and from there attempt connecting to the connect cluster. But then again, there are lots of information you'd need that the alarm in CloudWatch does not give you.

Some Kafka vendors, will tie you into their ecosystem tooling (yes, that includes AWS into that one). It's great if you are buying into it, or their supported partners, but the options of actions are limited and up to you to implement.

Ans so, that was the important thing to me, Kafka Connect is an open source framework, that works wherever you want it to run. So in that spirit, I decided instead to write a simple micro-service, in Python, for anyone to use and contribute to.

Now, you guessed it already, this service is AWS integrated and as it currently stands, will send notifications to SNS, report metrics to CloudWatch, but there will definitely be room for other integrations (webhooks are very versatile!).

How does it work?

Why in Python I hear you ask? Well, I had to write a python client library for Kafka connect to enable automation. So I decided to re-use that.

The service takes a configuration file that details

  • the connect clusters to monitor
    • the connectors to monitor, using regular expressions to include/exclude specific connectors
    • the actions to perform when a connector is not RUNNING
    • the notifications to send when a connector activity occurs
  • The monitoring to enable and its respective settings

The connect watcher can monitor more than one connect cluster at a time, as to save on deployments if your architecture and configuration allows for it.

The configuration file uses a JSON schema for both input validation and help with documentation of the required fields, which I hope will help users.

Conclusion

This is a fun project which I enjoy doing and will happily work on implementing new features, so please feel free to open a Feature Request or submit your code changes.

Top comments (0)