4 Cosmos validator mistakes that get you slashed at 3am

#cosmos #blockchain #infrastructure #devops

Cosmos validator slashing is almost entirely preventable. The operators who get slashed aren't usually victims of sophisticated attacks — they're running without one or more of the protection layers that professional validators treat as non-negotiable. Here are the four mistakes that show up most often.

Confusing double-sign with downtime: they are not the same thing. Most validators know about slashing in the abstract. Fewer understand that the two slashing conditions have completely different consequences:

Downtime: Miss more than 500 of the last 10,000 blocks → 0.01% slash, 10-minute jail. You can unjail, rejoin the active set, and recover. Delegators will notice, but it's survivable.

Double-signing: Sign two conflicting blocks at the same height → 5% slash, permanent jail. You cannot unjail after a double-sign. Your delegators lose 5% of their stake and you lose your validator permanently.

The reason this distinction matters operationally: double-signing almost never happens from attacks. It happens when an operator runs a backup validator node without proper safeguards and both nodes come online simultaneously. The "I'll just spin up a second node as a failover" approach is exactly how you trigger a permanent 5% slash.

Using a backup node instead of TMKMS or Horcrux. The correct answer to "what if my validator goes down?" is not a hot standby. It's key management.

TMKMS (Tendermint Key Management System) extracts the signing key from your validator node into a separate process. It tracks which blocks have been signed and refuses to sign conflicting blocks; double-sign protection at the signing layer, not the infrastructure layer. If someone compromises your validator host, they don't get the key.

Horcrux goes further: it splits your private key into shares using multi-party computation. You configure a threshold, say 2-of-3, so no single server holds the complete key. An attacker needs to compromise multiple servers simultaneously. And if one Horcrux node goes offline, the others still have quorum to sign, so you get high availability without the double-sign risk of running a hot standby.

The setup difference: TMKMS is a single process that protects the key. Horcrux is a distributed cluster that eliminates the single point of failure entirely. For validators with significant stake, Horcrux is the standard.

Monitoring at the wrong threshold. If your alert fires when you're jailed, it's too late.

The Cosmos Hub jails you at 500 missed blocks out of 10,000. Most people set their alert at 500. By the time the alert fires, you're already jailed and the 0.01% slash has happened.

The right approach is two alerts:


- alert: ValidatorMissedBlocks

  expr: increase(cosmos_validator_missed_blocks_total[10m]) > 10

  for: 2m

  labels:

    severity: warning

- alert: ValidatorJailRisk

  expr: cosmos_validator_missed_blocks_total > 400

  for: 1m

  labels:

    severity: critical

The warning gives you early signal. The critical fires at 400 - 80% of the jail threshold, when you still have time to intervene. The critical alert should go to PagerDuty, not just Slack. If it pages at 3am and nobody wakes up, you're jailed before anyone sees the message.

Not using Cosmovisor for chain upgrades. Chain upgrades cause a disproportionate share of slashing events. The validator misses the upgrade block, falls behind, and gets jailed for downtime. Or the operator runs the old binary past the upgrade height and ends up on the wrong fork.

Cosmovisor solves this. It watches for upgrade governance proposals, downloads the new binary, and swaps it automatically at the correct block height, no manual intervention required.

export DAEMON_NAME=gaiad

export DAEMON_HOME=$HOME/.gaia

export DAEMON_ALLOW_DOWNLOAD_BINARIES=true

export DAEMON_RESTART_AFTER_UPGRADE=true

cosmovisor run start

The alternative is manually monitoring governance, tracking upgrade heights, and being online at the exact moment the upgrade executes. In practice this means either a lot of alerting overhead or missing upgrades when the timing is inconvenient. Cosmovisor eliminates the category of risk entirely.

The layer most people skip: runbooks.
All the monitoring in the world doesn't help if the person who gets paged at 3am doesn't know what to do. The minimum runbook set for a Cosmos validator covers three scenarios: jailed for downtime, disk space critical, and sentry node offline. At 3am you don't want to be googling the unjail command or figuring out which log to check first.

The full guide: including the complete TMKMS and Horcrux configurations, sentry node setup, and all seven protection layers, is at thegoodshell.com.

Happy to answer questions in the comments if you are working through any of these.