Michael Levan

Posted on Jan 31, 2023

Multi-Cluster Kubernetes Monitoring And Observability

#crypto #blockchain #cryptocurrency #web3

There’s a big chance that throughout your Kubernetes journey, you’ll have to manage multiple Kubernetes clusters. Whether they’re all production-based, for dev environments, or clusters for each engineer set up in a singletenancy methodology, there will be multiple clusters.

Because of that, you’ll need a way to monitor them all in one place.

In this blog post, you’ll learn about the purpose of multi-cluster monitoring and a few tools/platforms that can help you implement it in production.

The Purpose

For this section, we’ll use an example of Prometheus and Grafana because it’s relatable for many engineers and it’s one of the most popular stacks to use for monitoring and observability in Kubernetes.

Installing Prometheus and Grafana in Kubernetes is relatively straightforward (not easy, just straightforward).

You’ll see two primary methods:

The Prometheus Operator
Kube-Prometheus

The Prometheus Operator is like any other Kubernetes Operator. It gives you a declarative way to manage Kubernetes resources and confirms that the current state is the desired state.

Kube-Prometheus is almost like a “ready to go” installation of Prometheus and Grafana. It not only installs Prometheus and Grafana, but best practice dashboards and other configurations like Alert Manager.

With Helm, you can get either up and running in a matter of minutes.

But here’s the problem - that’s for one cluster.

What about if you have multiple clusters? Well, not much you can do with just installing the Operator or Kube-Prometheus. You’re sort of stuck (by default) having to install Prometheus and Grafana one by one on each cluster, which results in multiple instances of Prometheus and Grafana to access if you want to set up alerts or check your stack. This, of course, is not a good option because it doesn’t scale. You can’t have fifty (50) clusters running 50 instances of Prometheus and 50 instances of Grafana. It’s not realistic for any highly-functioning engineering department.

You need a method to monitor and observe workloads, but do so in one place that has all of your Prometheus and Grafana configurations.

Now, of course, the above relates to any monitoring and observability platform. Remove Prometheus or Grafana and insert whatever other tool you like to use.

In the following sections, you’ll learn about three tools/platforms that you can use which make centralizing your configurations a bit more straightforward.

Thanos

The whole goal of Thanos is: Prometheus with long-term storage capability and the ability to collect metrics from multiple clusters.

How does it work?

In the previous section, the example was given to portray Prometheus and Grafana installations. With Thanos, yes, you still have to install Prometheus on every cluster. The difference is that you can collect metrics from every cluster and export them in one location, which makes viewing much easier (the downside is you still have to manage multiple instances of Prometheus).

There are three primary components (that are decoupled) in Thanos:

Metric sources
Stores
Queries

Metric sources are the instances of Prometheus that are running on each cluster. The Prometheus exporter can be used as a metric source, which takes the metrics and consumes/pushes them to one location (the object store). How does this work? Thanos installs a sidecar in each Kubernetes Pod, and that sidecar sends Prometheus metrics every two hours to an object store of your choosing. An object store is a persistent location where the metrics are stored. This can be, for example, an S3 bucket in AWS or an Azure Storage Account.

Stores are, as described above, an object store where you can save the metrics. This could be anything from S3 to Azure Storage Accounts. You can find the full list here. The Store continuously syncs and is the place where you can query metrics from various Prometheus installations on different clusters.

Queries, otherwise known as the Query Layer, is what you expect - it gives you the ability to query the data that’s in the Store. The idea is to have resilient querying so you don’t have to worry about a node (where Prometheus is installed, which is the k8s cluster, but sometimes referred to as a node in the Thanos documentation) not being queryable.

Please note that at this time Thanos is a CNCF Incubator project. You can think of it as production-ready, but still going through a ton of changes. Because of that, as with all Incubator projects, continue with the understanding that the platform will most likely change as it’s being developed.

The installation and configuration varies between cloud, so it seemed like the best option to paste the installation here: https://thanos.io/v0.30/thanos/quick-tutorial.md/

Grafana Cloud

Grafana has a SaaS version, which does have a paid model, but also a free model (there’s a cap on the amount of metrics that can be consumed in the free version).

You can sign up for free by going to the following link: https://grafana.com/products/cloud/.

Once at the product page, click the blue Create free account button.

Once you sign up, choose the Monitor Kubernetes option on the GET STARTED page.

Next, you’ll see several options that are available for Kubernetes. Before you can use them however, you need to get Grafana installed on a Kubernetes cluster.

To install Grafana, go to the Configuration tab.

Scroll down to the Configuration instructions and click the blue Agent configuration instructions button. There’s also an Operator available, but it’s in beta.

Go through the installation instructions that you see on your screenshot (no screenshot because it shows sensitive information).

Once complete, wait a minute or two, refresh your page, and click on the Kubernetes Monitoring option again.

You’ll now see that the monitoring and observability information is being consumed by Grafana.

Rinse and repeat for as many clusters as you have.

New Relic

There are a lot of “enterprise options” that are available, but in this section, you’ll dive into New Relic.

Keep in mind, what’s going to be talked about throughout this section goes for most (probably all) of the monitoring and observability tools in the same category as New Relic. Those include tools like:

Datadog
AppDynamics
Dynatrace

If you’re reading this and think to yourself “I have Datadog”, that’s fine. The concepts are still the same (managing multiple cluster monitoring and observability in one location). It’s just with a different tool.

As with all “enterprise tools”, there is a cost associated. However, New Relic has a free version.

Let’s dive into the pricing structure a bit. Below you’ll see a screenshot of how New Relic deals with pricing and how to think about it.

Here’s another screenshot that definitely gives a bit more information. As you can see, you can start with the Standard version, which has 100 GB of data included for free. If you want to test this out, or even have it running on your home lab, that’s more than enough data for free.

Now that you’ve looked at the pricing model, let’s dive into the installation.

First, sign up for New Relic for free here: https://newrelic.com/

Next, once you’re signed in, you’ll see a few options to install New Relic. Choose the Kubernetes option.

Next, enter your Kubernetes cluster name and click the Continue button.

Set additional data that you want to gather if the defaults don’t work for you and click the Continue button. Typically, the defaults are what you’ll want if this is your first installation.