Karan Pratap Singh

Posted on Apr 15, 2022

Advanced Monitoring with NATS surveyor

#distributedsystems #monitoring

In this article, we’ll set up nats-surveyor for advanced monitoring of our NATS servers through Prometheus and Grafana.

What is NATS Surveyor?

NATS surveyor polls the NATS server for Statz messages to generate data for Prometheus. This allows a single exporter to connect to any NATS server and get an entire picture of a NATS deployment without requiring extra monitoring components or sidecars.

It’s really powerful as we can now just connect the data generated for Prometheus and setup dashboards on observability platforms like Grafana.

Setup

Let’s setup our local super cluster and start our surveyor service.

Local cluster

To setup our local super cluster, we can use this repo. Here’s the topology.

$ git clone https://github.com/ColinSullivan1/nats-local-supercluster.git
$ cd nats-local-supercluster
$ ./start_supercluster.sh

Surveyor

Now that our local super cluster is up and running, we can setup nats-surveyor.

For now, we’ll do it with docker and docker-compose.

Note: We can also install nats-surveyor directly from the Github releases as well

$ git clone https://github.com/nats-io/nats-surveyor.git
$ cd nats-surveyor/docker-compose
$ ./survey.sh "nats://$(ipconfig getifaddr en0):4000" 9 ../../nats-local-supercluster/auth/nkeys/creds/myoperator/SYS/SYS.creds
[+] Running 3/0
 ⠿ Container nats-surveyor  Created                                                      0.0s
 ⠿ Container prometheus     Created                                                      0.0s
 ⠿ Container grafana        Created                                                      0.0s
Attaching to grafana, nats-surveyor, prometheus
...

Notice how we use ipconfig getifaddr en0 to get the current IP of the system and SYS.creds with NATS surveyor.

Generating demo data

For generating traffic we can use the nats bench command

Note: Learn more about NATS CLI in the previous article.

$ nats bench -s 127.0.0.1:4000 --msgs 100000000 --pub 1 --sub 1 --creds ../../nats-local-supercluster/auth/nkeys/creds/myoperator/myaccount/myuser.creds subject
16:38:53 Starting pub/sub benchmark [subject=subject, msgs=100,000,000, msgsize=128 B, pubs=1, subs=1]
16:38:53 Starting subscriber, expecting 100,000,000 messages
16:38:53 Starting publisher, publishing 100,000,000 messages
Finished     40s [==========================================] 100%
Finished     40s [==========================================] 100%

NATS Pub/Sub stats: 4,924,665 msgs/sec ~ 601.16 MB/sec
 Pub stats: 2,462,354 msgs/sec ~ 300.58 MB/sec
 Sub stats: 2,462,346 msgs/sec ~ 300.58 MB/sec

Yes, we just transferred 100 Million messages in just 40s alongside running a super cluster on the same machine! NATS has amazing performance.

We can also use nats bench with --pubsleep flag to simulate real-time traffic in the background while we look at the dashboards.

$ nats bench -s 127.0.0.1:4000 --msgs 100000000 --pubsleep 1ms --pub 1 --sub 1 --creds ../../nats-local-supercluster/auth/nkeys/creds/myoperator/myaccount/myuser.creds subject
14:24:20 Starting pub/sub benchmark [subject=subject, msgs=100,000,000, msgsize=128 B, pubs=1, subs=1, js=false, pubsleep=1ms, subsleep=0s]
14:24:20 Starting subscriber, expecting 100,000,000 messages
14:24:20 Starting publisher, publishing 100,000,000 messages
Receiving    18s [--------------------------------------------------------------]   0%
Publishing   18s [--------------------------------------------------------------]   0%

Monitoring

Now we should be able to go to Grafana running on [localhost:3000/dashboards](http://localhost:3000/dashboards) and see all the available monitoring dashboards.

Note: You might be presented with a login screen, the default user is admin and the password is admin

Here we can see we have different dashboards such as Clients, Clusters, NATS Overview, Network Usage, Super Cluster, etc. So let’s explore these dashboards one by one!

Clients

In the client dashboard, we can monitor things like slow consumers, subscriptions, connections per second, and much more.

Clusters

In the cluster dashboard, we can see how many clusters we are running with bandwidth and messages per second.

Overview

The overview dashboard provides basic information about how many servers and clusters we are running with route or gateway connections.

Check out that insane 300k messages/sec, and that’s on a development machine!

Network Usage

The network dashboard is all about how much data is being sent or received in our clusters.

Node Resource Usage

This dashboard provides information about individual nodes and provides metrics like CPU and memory usage of our nodes.

Super Cluster

This dashboard works at the super cluster level and provides metrics like super cluster bandwidth, connections, message rate, and much more.

This makes it really easy to monitor multiple super clusters.

Conclusion

In this article, we set up NATS Surveyor, which is an incredible tool that makes it easy to setup monitoring for our NATS services as easily as a single command. It’s a must have if you’re running distributed systems with NATS at scale. Make sure to checkout the docs for more info.

I hope this article was helpful, feel free to reachout to me if you face any issues. Have a great day!