Most Cosmos validator monitoring is one of two things: a check that the process is "up", or a 200-panel Grafana dashboard nobody looks at until after the incident. Neither pages you before you get jailed.
Quick disclosure: I am the co-founder who does the marketing at The Good Shell, not the one on call with the validators at 3am. But I sit next to the engineers who are, and their alerting config was living in a private repo doing nobody else any good. So I wrote it up. These are the 10 alert rules our team actually pages on for Cosmos Hub and Cosmos SDK validators. Real PromQL, production thresholds, nothing decorative. Everything you need is in this post: the rules, the scrape config, and the alert routing. Copy it, change the thresholds, ship it.
Before the rules: where the metrics come from
CometBFT exposes Prometheus metrics on :26660/metrics when you set instrumentation.prometheus = true in config.toml. Node-level metrics (disk, memory, clock) come from node_exporter.
One gotcha that breaks copy-pasted rules: the metric namespace. Modern CometBFT uses the cometbft_ prefix. Older chains, or any chain running instrumentation.namespace = "tendermint", expose the same metrics under tendermint_. Check yours before deploying:
curl -s localhost:26660/metrics | grep validator_missed_blocks
Examples below use cometbft_. Swap the prefix if your node uses tendermint_.
1. Validator is missing blocks (jailing risk)
The single most important rule. Page on the rate first, because by the time you hit the absolute jail threshold it is often too late.
- alert: ValidatorMissingBlocks
expr: increase(cometbft_consensus_validator_missed_blocks[5m]) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "Validator missed {{ $value }} blocks in the last 5m"
Cosmos Hub jails you at 500 missed in a 10,000 block window (min_signed_per_window = 0.05). Add a critical rule with a buffer, and tune the number to your chain's signed-blocks window:
- alert: ValidatorJailImminent
expr: cometbft_consensus_validator_missed_blocks > 400
for: 1m
labels:
severity: critical
annotations:
summary: "Approaching jail threshold: {{ $value }} missed blocks"
2. Dropped from the active set or jailed
Voting power goes to zero when you are jailed or fall out of the active set. This is the "you are already out" alarm.
- alert: ValidatorNotInActiveSet
expr: cometbft_consensus_validator_power == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Validator power is 0 (jailed or outside the active set)"
3. Block height is not advancing (node halted)
If height stops moving, the node is stuck: a failed upgrade, a corrupted DB, or a panic loop. This fires even when the process is technically "up".
- alert: BlockHeightStalled
expr: increase(cometbft_consensus_height[3m]) == 0
for: 2m
labels:
severity: critical
annotations:
summary: "No new blocks in 3m, node is stuck"
4. In the set but not signing recent blocks
Distinct from missed_blocks: this catches a validator that is active but whose signer stopped producing signatures (a dead remote signer, a key issue). It compares the chain height to the last height you actually signed.
- alert: ValidatorNotSigning
expr: cometbft_consensus_height - cometbft_consensus_validator_last_signed_height > 5
for: 2m
labels:
severity: critical
annotations:
summary: "Last signed height is {{ $value }} blocks behind chain head"
5. Low peer count
A validator behind sentries should always have peers. A collapsing peer count means a sentry is down or you are being partitioned, both of which lead to missed blocks.
- alert: LowPeerCount
expr: cometbft_p2p_peers < 5
for: 5m
labels:
severity: warning
annotations:
summary: "Only {{ $value }} peers connected"
6. Block production is slowing down
Rising block intervals mean the network (or your node) is struggling to finalize. Useful as an early "something is wrong" signal before blocks are outright missed.
- alert: SlowBlockProduction
expr: avg_over_time(cometbft_consensus_block_interval_seconds[5m]) > 8
for: 5m
labels:
severity: warning
annotations:
summary: "Average block interval is {{ $value }}s over 5m"
(Bonus: cometbft_consensus_rounds > 1 sustained tells you consensus is taking multiple rounds to commit, another stress signal.)
7. Disk almost full
Chain data grows continuously. A validator that runs out of disk halts instantly. Alert with enough runway to prune or expand.
- alert: DiskSpaceCritical
expr: |
(node_filesystem_avail_bytes{mountpoint="/"} /
node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
for: 5m
labels:
severity: critical
annotations:
summary: "Root filesystem at {{ $value | humanize }}% free"
8. Memory pressure (upgrade OOM risk)
Under normal load gaiad sits at 16 to 32GB. During coordinated upgrades it spikes, and an OOM kill at upgrade height is a classic jailing event. Catch the pressure before the kernel does.
- alert: HighMemoryPressure
expr: |
(node_memory_MemAvailable_bytes /
node_memory_MemTotal_bytes) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "Only {{ $value | humanize }}% memory available"
9. Remote signer is down (TMKMS or Horcrux)
If your signer dies, the node keeps running but cannot sign, and you march toward the jail threshold silently. On Cosmos Hub that is roughly 16 minutes (500 blocks at ~2s). This assumes you scrape your signer host (a blackbox or port check works if TMKMS has no native exporter).
- alert: RemoteSignerDown
expr: up{job="tmkms"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Remote signer target is down, validator cannot sign"
10. Clock drift (NTP)
Underrated and brutal. With Proposer-Based Timestamps, a validator whose clock drifts past the chain's precision bound starts seeing valid proposals as "not timely" and prevotes nil, and its own proposals get rejected. The fix is monitoring the offset, not assuming chrony is fine. Needs the node_exporter timex collector.
- alert: ClockDrift
expr: abs(node_timex_offset_seconds) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Clock offset is {{ $value }}s, consensus timing at risk"
Wiring it up
Point Prometheus at the node and node_exporter:
scrape_configs:
- job_name: cometbft
static_configs:
- targets: ["validator:26660"]
- job_name: node
static_configs:
- targets: ["validator:9100"]
Then route severity to where it belongs: critical to PagerDuty (wake someone up), warning to Slack. The point of splitting them is that you should be able to ignore Slack at 3am and still get paged for the things that actually jail you (rules 1 to 4, 7, 8, 9).
Take it
That is the whole baseline. Drop the ten rules into a rules.yml, then route by severity so the noise lands in Slack and the things that actually jail you go to PagerDuty:
route:
receiver: slack
group_by: [alertname]
routes:
- matchers: [severity="critical"]
receiver: pagerduty
receivers:
- name: pagerduty
pagerduty_configs:
- service_key: <your-pagerduty-key>
- name: slack
slack_configs:
- api_url: <your-slack-webhook>
channel: "#validator-alerts"
Swap the thresholds for your chain's parameters and you have real alerting in an afternoon. This is the baseline our engineers run for Cosmos validators day to day. If it saves you one 3am page, it did its job. Better thresholds and war stories welcome in the comments.
Top comments (0)