Sonia

Posted on May 28

10 production-grade alert rules for Cosmos validators (with real PromQL)

#cosmos #prometheus #monitoring #sre

Most Cosmos validator monitoring is one of two things: a check that the process is "up", or a 200-panel Grafana dashboard nobody looks at until after the incident. Neither pages you before you get jailed.

Quick disclosure: I am the co-founder who does the marketing at The Good Shell, not the one on call with the validators at 3am. But I sit next to the engineers who are, and their alerting config was living in a private repo doing nobody else any good. So I wrote it up. These are the 10 alert rules our team actually pages on for Cosmos Hub and Cosmos SDK validators. Real PromQL, production thresholds, nothing decorative. Everything you need is in this post: the rules, the scrape config, and the alert routing. Copy it, change the thresholds, ship it.

Before the rules: where the metrics come from

CometBFT exposes Prometheus metrics on :26660/metrics when you set instrumentation.prometheus = true in config.toml. Node-level metrics (disk, memory, clock) come from node_exporter.

One gotcha that breaks copy-pasted rules: the metric namespace. Modern CometBFT uses the cometbft_ prefix. Older chains, or any chain running instrumentation.namespace = "tendermint", expose the same metrics under tendermint_. Check yours before deploying:

curl -s localhost:26660/metrics | grep validator_missed_blocks

Examples below use cometbft_. Swap the prefix if your node uses tendermint_.

1. Validator is missing blocks (jailing risk)

The single most important rule. Page on the rate first, because by the time you hit the absolute jail threshold it is often too late.

- alert: ValidatorMissingBlocks
  expr: increase(cometbft_consensus_validator_missed_blocks[5m]) > 10
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "Validator missed {{ $value }} blocks in the last 5m"

Cosmos Hub jails you at 500 missed in a 10,000 block window (min_signed_per_window = 0.05). Add a critical rule with a buffer, and tune the number to your chain's signed-blocks window:

- alert: ValidatorJailImminent
  expr: cometbft_consensus_validator_missed_blocks > 400
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Approaching jail threshold: {{ $value }} missed blocks"

2. Dropped from the active set or jailed

Voting power goes to zero when you are jailed or fall out of the active set. This is the "you are already out" alarm.

- alert: ValidatorNotInActiveSet
  expr: cometbft_consensus_validator_power == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Validator power is 0 (jailed or outside the active set)"

3. Block height is not advancing (node halted)

If height stops moving, the node is stuck: a failed upgrade, a corrupted DB, or a panic loop. This fires even when the process is technically "up".

- alert: BlockHeightStalled
  expr: increase(cometbft_consensus_height[3m]) == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "No new blocks in 3m, node is stuck"

4. In the set but not signing recent blocks

Distinct from missed_blocks: this catches a validator that is active but whose signer stopped producing signatures (a dead remote signer, a key issue). It compares the chain height to the last height you actually signed.

- alert: ValidatorNotSigning
  expr: cometbft_consensus_height - cometbft_consensus_validator_last_signed_height > 5
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Last signed height is {{ $value }} blocks behind chain head"

5. Low peer count

A validator behind sentries should always have peers. A collapsing peer count means a sentry is down or you are being partitioned, both of which lead to missed blocks.

- alert: LowPeerCount
  expr: cometbft_p2p_peers < 5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Only {{ $value }} peers connected"

6. Block production is slowing down

Rising block intervals mean the network (or your node) is struggling to finalize. Useful as an early "something is wrong" signal before blocks are outright missed.

- alert: SlowBlockProduction
  expr: avg_over_time(cometbft_consensus_block_interval_seconds[5m]) > 8
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Average block interval is {{ $value }}s over 5m"

(Bonus: cometbft_consensus_rounds > 1 sustained tells you consensus is taking multiple rounds to commit, another stress signal.)

7. Disk almost full

Chain data grows continuously. A validator that runs out of disk halts instantly. Alert with enough runway to prune or expand.

- alert: DiskSpaceCritical
  expr: |
    (node_filesystem_avail_bytes{mountpoint="/"} /
     node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Root filesystem at {{ $value | humanize }}% free"

8. Memory pressure (upgrade OOM risk)

Under normal load gaiad sits at 16 to 32GB. During coordinated upgrades it spikes, and an OOM kill at upgrade height is a classic jailing event. Catch the pressure before the kernel does.

- alert: HighMemoryPressure
  expr: |
    (node_memory_MemAvailable_bytes /
     node_memory_MemTotal_bytes) * 100 < 10
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Only {{ $value | humanize }}% memory available"

9. Remote signer is down (TMKMS or Horcrux)

If your signer dies, the node keeps running but cannot sign, and you march toward the jail threshold silently. On Cosmos Hub that is roughly 16 minutes (500 blocks at ~2s). This assumes you scrape your signer host (a blackbox or port check works if TMKMS has no native exporter).

- alert: RemoteSignerDown
  expr: up{job="tmkms"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Remote signer target is down, validator cannot sign"

10. Clock drift (NTP)

Underrated and brutal. With Proposer-Based Timestamps, a validator whose clock drifts past the chain's precision bound starts seeing valid proposals as "not timely" and prevotes nil, and its own proposals get rejected. The fix is monitoring the offset, not assuming chrony is fine. Needs the node_exporter timex collector.

- alert: ClockDrift
  expr: abs(node_timex_offset_seconds) > 0.1
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Clock offset is {{ $value }}s, consensus timing at risk"

Wiring it up

Point Prometheus at the node and node_exporter:

scrape_configs:
  - job_name: cometbft
    static_configs:
      - targets: ["validator:26660"]
  - job_name: node
    static_configs:
      - targets: ["validator:9100"]

Then route severity to where it belongs: critical to PagerDuty (wake someone up), warning to Slack. The point of splitting them is that you should be able to ignore Slack at 3am and still get paged for the things that actually jail you (rules 1 to 4, 7, 8, 9).

Take it

That is the whole baseline. Drop the ten rules into a rules.yml, then route by severity so the noise lands in Slack and the things that actually jail you go to PagerDuty:

route:
  receiver: slack
  group_by: [alertname]
  routes:
    - matchers: [severity="critical"]
      receiver: pagerduty
receivers:
  - name: pagerduty
    pagerduty_configs:
      - service_key: <your-pagerduty-key>
  - name: slack
    slack_configs:
      - api_url: <your-slack-webhook>
        channel: "#validator-alerts"

Swap the thresholds for your chain's parameters and you have real alerting in an afternoon. This is the baseline our engineers run for Cosmos validators day to day. If it saves you one 3am page, it did its job. Better thresholds and war stories welcome in the comments.

DEV Community