Monitoring InnoDB Cluster members using Prometheus and MySQL Exporter

#mysql #database #devops #monitoring

The high-availability features of MySQL InnoDB Cluster in its open source (Community) edition have come a long way, and in recent years the 8.x versions have delivered immense improvements in this area over the veteran 5.7 versions. I've found that monitoring such a cluster with OSS tools to be more involved, and through this post I hope to shed some light and share what has worked for me and my team.

TL;DR

8.x InnoDB Clusters (single Primary) are more tolerant to network failures (Auto Rejoin)
5.7.x InnoDB Clusters can also be managed using MySQL Shell from the 8.x series, although you will miss some of the newest management features (e.g. Clone)
Prometheus and MySQL Exporter will enable you to monitor cluster members as reported in the replication_group_members table.

1. DIY?

Let me start off with the obvious: you can get a high-availability database as a service in each of the public clouds - it will cost you, but it's probably well worth it. The upside to doing things on your own is you'll learn a lot in the process, about High-Availability, distributed replication and logs, monitoring and metrics, etc. Given a choice today I would go with a managed service, but it wasn't an option at the time. If you're here, I guess you're in the same boat - so hang on 😁

2. Prometheus

Prometheus is a remarkable open source project - it is incredibly powerful, full featured and offers rock solid stability. A Prometheus cluster coupled with its Alertmanager can keep track of your database clusters and alert your team without breaking a sweat. There are a few guides out there on setting up Prometheus and MySQL Exporter, so I will skip ahead and assume you are familiar and have got these up and running.

3. Group Replication

MySQL Group Replication was designed for a cluster of servers connected by a reliable, low-latency network. In my team's case, we required high-availability of the cluster from several datacenters and our network unfortunately failed quite often. Our 3-node, single primary InnoDB Cluster based on MySQL 5.7 Group Replication would routinely kick out one of its members after it been out of sync with the cluster for too long.

When manually rejoining the cluster fails, syncing the server's replication logs manually is an especially harrowing experience. These occurrences were greatly reduced by upgrading our cluster to the 8.0.x series of releases, making use of the Auto Rejoin feature, and monitoring our clusters members using MySQL Exporter

4. Monitoring cluster members

The replication_group_members table reflects the node's view of the cluster, allowing the detection of a missing or failing node or a split-brain state. The 8.0 version is slightly more detailed, offering the node's version and role in the cluster.

Starting with with v0.13.0, my small contribution to MySQL Exporter now allows it to export metrics for the node's view of the cluster:

Pull a release candidate image of mysqld-exporter using: docker pull prom/mysqld-exporter:v0.13.0-rc.0 (Stable releases are very rare, I hope there will be one soon..)
Configure your collectors. See the mysqld-exporter README
Configure Prometheus alerts based on PromQL queries of the exported metrics.

Suggested PromQL queries

Number of ONLINE members:
count by (<tags identifying cluster members>) (mysql_perf_schema_replication_group_member{cluster_name="my_stable_cluster",group="infra",member_state="ONLINE"})) != 3

Missing PRIMARY:
count by (<tags identifying cluster members>) (mysql_perf_schema_replication_group_member{cluster_name="my_stable_cluster",group="infra",member_state="ONLINE",member_role="PRIMARY"})) < 1

Help My Metrics are Duplicated

As I mentioned above, the metrics exported from a cluster member reflect this member's view of the cluster. When each exporter in a 3-node cluster reports 3 metrics, you'll end up with.. that's right a bunch of metrics.

Reporting all these metrics makes it possible to detect a situation where two members have differing view of the cluster's state. InnoDB Cluster and its Group Communication System (GCS) seem to be pretty good at avoiding such cases, so to ignore the duplication of metrics, use PromQL's group by in the example queries described above.

5. The End

Hope you found this useful!

DEV Community