DEV Community

Cover image for The 8 Grafana panels every Cosmos validator dashboard should have (and most don't)
Sonia
Sonia

Posted on

The 8 Grafana panels every Cosmos validator dashboard should have (and most don't)

Most validator dashboards I look at show block height and missed blocks. That tells you the node is alive. It does not tell you whether you are about to get jailed. Here are the 8 panels that change that.

I sit at an unusual intersection. My day job is marketing for a DevOps and Web3 infrastructure team. That means I spend a lot of time on calls where engineering leads share their screens, walk me through their stack, and ask if our team can help with the bits that are breaking.

Validator operators are a recurring guest on those calls. Cosmos, Solana, EVM, you name it. And after enough of those calls, a pattern shows up that I want to write down: most validator dashboards look the same, and most of them are missing the panels that actually matter.

This is not a tutorial. The PromQL is not in here. What I want to share is the pattern I keep seeing on those calls, and the reasoning behind the 8 panels that close the gap between "the node is up" and "the validator is healthy."

If you operate a Cosmos validator and the only graphs you check daily are block height and peer count, this post is for you.

The default Cosmos dashboard is a screenshot, not an operations tool

The most-imported Cosmos validator dashboards on Grafana Labs are very pretty. Big numbers, color-coded gauges, a graph of missed blocks rising over time. Operators import them on day one, take a screenshot for the company Notion page, and never look at them again until something breaks.

The problem is that "something is broken" arrives a few different ways:

  • The validator is signing fine but inbound peers silently dropped to zero two hours ago and you have no idea.
  • The chain is producing blocks slowly because of a network upgrade, and your alerting is paging you for missed blocks that everyone is missing.
  • The disk is full of old WAL files. The validator is queueing writes. Block production is one minute away from breaking. Nothing in your dashboard hints at it.

Those are not exotic failure modes. They are the standard playbook. And the popular dashboards have no panels for any of them.

The mental shift: "is it up" vs "is it healthy"

Most of the dashboards I see are built to answer one question: is the node up. That is a useful question for the first ten minutes after spinning up a validator. After that, the question that actually matters is: is the validator healthy enough to keep signing under pressure.

Those are different questions. "Up" is a snapshot. "Healthy" is a trajectory. A dashboard that only shows current state is going to miss every leading indicator, and you will get woken up by the consequence instead of the cause.

The 8 panels below are the ones operators look at when an incident is in progress, not the ones that get included in investor decks.

The 8 panels, and why each one matters

1. Signing efficiency rate (rolling, not absolute)

Most dashboards show a counter of missed blocks since boot. That is close to useless. A validator that missed 400 blocks during a memory upgrade six months ago and has been perfect since is in a very different situation from a validator that has missed 80 blocks in the last hour. Same counter, opposite stories.

What you want is the ratio of signed blocks within the rolling signing window your chain uses to compute jailing. For Cosmos Hub that window is 10,000 blocks. The number sits at 1.0 when everything is fine and starts moving the second something goes wrong. That movement is your earliest signal.

2. Jailing prediction window

This is the panel I have never seen on a community dashboard, and it is the one I always recommend adding first. It answers a single question: at the current rate of missed blocks, how many minutes until the validator gets jailed.

When this panel is green you have hours of buffer. When it turns yellow you need to pay attention. When it turns red you stop whatever you are doing. It is also the panel that turns a stressful incident into a structured one. You stop staring at the missed-blocks counter trying to do math. The math is already on screen.

3. Block time deviation from network median

Every now and then a Cosmos chain has a slow patch. Maybe a validator with a big stake is being restarted. Maybe a network upgrade is staggering block production. Maybe the chain is just under load.

If you do not track network-wide block time, you cannot tell whether your missed blocks are your fault or the network's. And if you cannot tell that, you end up paging people for problems they cannot fix. This panel filters out the false alarms.

4. Peer count, split by direction

This is the panel that catches a failure mode I see more often than I should: outbound peers stable at 8, dashboard looks fine, but inbound peers dropped to zero two hours ago because of a sentry NAT change nobody documented. Your validator is producing blocks but invisible to the rest of the network. Eventually mempool depth grows, propose-block rounds start failing, and you get jailed without anything on the default dashboard flinching.

A single "peer count" number hides this completely. Splitting it into inbound and outbound takes one line of dashboard config and makes the failure mode visible.

5. Local RPC p99 latency

The validator's RPC port is part of an SLA whether you realise it or not. Your alerting hits it. Your block explorer hits it. Your monitoring systems hit it. When the validator process is under pressure, RPC latency spikes first, before missed blocks start showing up.

I think of this panel as the smoke alarm. By the time the missed-blocks counter is on fire, the RPC latency graph has been smoking for ten minutes. Catching it during the smoke phase is the difference between "wake up someone tomorrow" and "page on-call now."

6. Mempool depth and rejection rate

Two metrics on one graph. Mempool depth tells you whether transactions are flowing through the validator. The rejection rate tells you whether the validator is rejecting transactions because something is wrong (block size limits, sequence mismatches, recheck failures during a fork).

A flat low mempool when the chain is busy means your peer graph is broken. A growing mempool with a spike in rejections means you are about to fail propose-block rounds. Both are early warnings of jailing risk, and neither shows up if you only watch block height.

7. Process saturation correlated with chain misses

This is the panel that closes the loop on root cause analysis. The default node-exporter dashboards show CPU, memory and disk in isolation. The chain dashboards show missed blocks in isolation. Neither one tells you whether the missed blocks are caused by IO wait, memory pressure or CPU saturation.

When I look at operators' setups, the dashboards that catch root cause fastest have these metrics overlaid on a single graph with a shared time axis. When IO wait climbs and missed blocks start incrementing in the same window, you have an IO-bound validator. That is not the kind of thing you guess at 3am. You see it.

8. Sentry reachability from the validator

If you run a sentry architecture, this panel tells you whether your validator can actually reach the sentries that are supposed to protect it. Sentries that block from the public internet but are unreachable from your validator are worse than no sentries at all. You think you are isolated and protected. You are isolated and silent.

This is one of the panels you do not need until the day you really need it. A simple TCP probe to each sentry's P2P port, displayed as a row of green or red stat panels, takes 10 minutes to set up and has saved more validators than I can count.

What the dashboard is for

If I had to summarise the difference between the popular dashboards and the one I keep seeing on healthy validator setups, it comes down to this:

The popular dashboards tell you the node is up. The good ones tell you whether the validator is in trouble before the chain notices.

That is not a question of how many panels you have. It is a question of which questions the panels are designed to answer. The 8 above are the ones that turn a dashboard into something operators actually look at during an incident, instead of something that lives in a tab nobody clicks.

If you have been operating a validator long enough to have a panel that has saved you in production and is not on this list, I would genuinely like to know which one. Drop it in the comments. The next operator setting up a dashboard from scratch will benefit from it more than from another generic import.

Top comments (1)

Collapse
 
17j profile image
Rahul Joshi

This is an essential guide for infrastructure reliability, providing a high-impact blueprint for monitoring the specific health metrics that actually prevent validator downtime. It’s a great example of "Observability with Intent," moving beyond generic stats to focus on the high-cardinality panels that are mission-critical for the Cosmos ecosystem.