How AppSignal Monitors Their Own Kafka Brokers

#devops #webperf #architecture #webdev

Today, we dip our toes into collecting custom metrics with a standalone agent. We'll be taking our own Kafka brokers and using the StatsD protocol to get the metrics into AppSignal. This post is for those with some experience in using monitoring tools, and who want to take monitoring to every corner of their architecture, or want to add their own metrics to their monitoring setup.

Why the Standalone Agent

AppSignal for Ruby, Elixir, and Node out-of-the-box gives you all one in one monitoring: errors, performance, host metrics, custom metrics, and dashboards.

By having all relevant metrics collected and sent to AppSignal, all the monitoring data is in one place. That is one of the things we think is crucial about good monitoring:

You need to have errors, performance and host metrics in one place, and you need as much of your architecture monitored like that in one place.

That way you can triage from all angles. For example, you might spot at a performance issue in a background job, then see it is caused by a long wait on the DB server, and then find out that that was caused by network issues on the host the database runs on.

This works out-of-the-box in a Ruby, Elixir or Node.js app. If you have part of your architecture run on systems without any Ruby, Elixir or Node.js, you can add their metrics to AppSignal as well using the standalone agent. For example, when you have certain workers do things in Rust, or when you have a standalone database server.

How WE do In-house Monitoring

We use this setup to monitor our Kafka brokers.

One of the things we need to keep a close eye on is available retention in Kafka. There are some hard limits on disk space and we need to make sure we can achieve the required retention. Otherwise, we need to expand our cluster.

Because the requests we monitor for our customers can have a lot of metadata added to them, that means that Kafka's topics can be huge. To make sure things don't go astray because a host disk is used fully, we've dimensioned our Kafka setup by disk usage that we monitor as well. 😉

Disk usage is easy to monitor because the standalone AppSignal agent will automatically send host metrics to AppSignal right out of the box.

You can actually see see in this 30 day resolution graph when we did a clean up of that disk, way before things went wrong 😁

That leaves retention. Retention is not available in Kafka's standard metrics, so we have to find another solution. We monitor retention by looking at the first indexed timestamp of each partition/topic on the broker itself. If that is shorter than a set amount of hours/days, we'd like to be notified as that might impact our customers during an outage.

We do so using our Kafka monitoring tool (called Watchman). It is written in Rust and we use a (very simple) StatsD client code to send data to the standalone AppSignal Agent running on each broker.

In our watchman process we send these retention metrics to the AppSignal agent that listens to StatsD:

    statsd_client.send_gauge("topic_retention_hours", (now_timestamp()  - timestamp) as f32 / 3600.0, &format!("topic:{},partition:{}", topic, partition))?;

And:

   statsd_client.send_gauge("topic_retention_short", ms_short, &format!("topic:{},partition:{}", topic, partition))?;

Kafka brokers report many more metrics, with JMX as the default reporter. So besides watchman, we also send these JMX metrics to our Standalone Agent running on the server.

Using Monitoring Data With Graphs and Triggers

Once that data gets into AppSignal, you can make dashboards with it, and set triggers like with everything in AppSignal.

Here’s what that dashboard looks like on AppSignal:

In this one-hour resolution you can actually see Kafka's logrotate which throws away these once every while. You can notice this happening with every sudden drop in the line.

We can then set up triggers when the topic_retention_short is above 0 ms, that way we will be alerted when the timestamp indicates that the retention is shorter than the specified time. This will help us make informed decisions on what topics to give more disk space or what impact the message flow has on retention times.

Roundup

This concludes our little dive into using a standalone StatsD agent to get data monitored and how we dogfood AppSignal.

This is not the only way to get custom metrics into AppSignal though, from your Ruby / Elixir / Node.js app you can also add any metric you want to. If you want to set this up yourself, read up on it in the documentation about custom metrics or the documentation about the standalone agent.

PS. Big thanks to Robert and Thijs. I typed the blogpost, but your brains wrote it 😉