Lessons from operating a Cosmos validator: a year of slashing near-misses

#cosmos #web3 #devops #blockchain

I have been operating a Cosmos validator for just over a year. In that time, I have not been slashed once.

I have been close, multiple times.

Slashing is one of those things where the financial cost (5% of your bonded stake on a double-sign, plus permanent jailing) is dwarfed by the reputational damage. Delegators leave. You spend the next quarter rebuilding trust. The slash itself is a single moment but the consequence is months long.

So I have learned to pay attention to the near-misses. They are the actual lessons. The slashings are just the bills.

Here are the five near-misses from the last year that taught me what actually matters in validator operations.

Near-miss 1: the backup node that almost double-signed

Six months ago I was migrating my validator to a new bare-metal host. The plan was standard: bring up the new node, sync it, swap the priv_validator_key, shut down the old one.

I had done this twice before without issues. Confidence was high.

I prepared the new node, copied the consensus state, and was ready to do the cutover during a low-traffic window. The moment I started gaiad on the new node, my monitoring exploded with double-sign warnings.

What happened: the old node was still running. I thought I had shut it down. I had shut down the wrong process (there were two gaiad processes on that machine because of an earlier debugging session I had forgotten about). Both nodes were now signing.

What saved me: TMKMS. The signing service refused to sign the second block because it had already signed at that height. The validator did not actually double-sign on chain. It just looked like it was about to.

The lesson is not "be more careful with cutovers". The lesson is "your operational discipline will fail at some point, so build infrastructure that fails safe when it does". TMKMS with its state-tracking is exactly that. Without it I would have eaten a 5% slash for what was, fundamentally, a forgotten zombie process.

Near-miss 2: the upgrade I forgot was tonight

Three months ago I was traveling, sitting in an airport lounge in Madrid waiting for a delayed flight. I noticed at 11pm local time that my phone had eight Slack notifications about a Cosmos Hub governance proposal that was passing.

The upgrade was scheduled for 3am UTC. I was about to board a flight. I had no laptop access during the flight. I would land at 6am, an hour and a half before the upgrade height.

I texted my co-founder. He was asleep.

My fallback was Cosmovisor. The system watches for governance proposals, downloads the new binary when the upgrade height approaches, and swaps it at the right block height automatically. I had installed it months earlier and never actually had to rely on it for a real upgrade. Tonight was the night.

I landed at 6am, opened my laptop in the taxi, checked the validator. It was on the new version. Cosmovisor had handled the binary swap at the upgrade height, gaiad had restarted, the validator had rejoined the chain after a 12-second gap that did not even register as a missed block.

The lesson is that "I'll be around for the upgrade" is not an operational plan. You will not be around when the upgrade matters. Build infrastructure that does not need you.

Near-miss 3: the disk that filled up at 3am

Two months ago, on a Tuesday night, my disk usage alert went off at 80%. I was watching a movie. I made a mental note to clean it up in the morning.

At 3am, my phone started buzzing with critical alerts. Disk was at 99%. gaiad had crashed because it could not write to the chain database. The validator had been offline for eight minutes. I had missed roughly 45 blocks at that point, well below the 500-block jail threshold but heading in the wrong direction fast.

I got up, SSHed in, pruned old logs, freed up 40GB. Gaiad restarted. Validator caught up. No jail. But I had spent the next two hours wide awake, watching the recovery, knowing I had cut it close.

The lesson is that warning alerts are not "I'll handle it later" signals. They are "I will handle this in the next hour or my future self will pay for it". The infrastructure was correct. The monitoring was correct. The mistake was treating the alert as informational.

The change I made: warning alerts now also trigger a Slack DM to me with a 30-minute snooze. If I do not acknowledge in 30 minutes, the alert escalates to critical and pages me. The system does not trust me to remember.

Near-miss 4: the sentry that went silent

A month ago I noticed my validator was reporting only 1 peer for about 12 hours before I caught it. The sentry node architecture means the validator connects only through its sentries, never to public peers. I have two sentries in different cloud providers.

What happened: one sentry crashed silently (kernel panic, not graceful shutdown, no alert from the OS layer). The other sentry was up, but had lost most of its peers because of a separate networking issue with that cloud provider. The validator was effectively talking to one degraded sentry with maybe 4 working external peers.

If the second sentry had also gone down, the validator would have been completely partitioned. It would have kept signing whatever blocks it received from its single peer, which on a sufficiently large network partition could have meant signing a minority fork and then double-signing once the majority chain came back. Catastrophic.

The lesson is that uptime is not the right metric for sentry health. Peer count is. A sentry with 0 peers is functionally offline even if the process is running. I now monitor peer count per sentry and alert if any sentry drops below 8 peers for more than 5 minutes.

Near-miss 5: the runbook I had never actually read

Two weeks ago I got a downtime jail at 4am because of a network blip from my cloud provider. Not a near-miss in the slash sense, since I was already jailed, but a near-miss in the recovery sense.

I had a runbook for "validator jailed for downtime". I had written it six months earlier, never read it again. At 4am, half asleep, I opened it and realized two things:

The runbook referenced gaiad commands that had changed syntax in a chain upgrade three months ago. The unjail command in my runbook would have failed.
The runbook assumed I knew where the validator key was stored, but I had moved it to a different path two months ago and never updated the doc.

I fumbled the recovery for 20 minutes before I got the unjail transaction through. During that time my validator was out of the active set. Delegators saw it. I got two Slack messages from concerned delegators within the hour.

The lesson is that a runbook you have never read is fiction. Once a quarter, walk through every runbook end to end, in the actual environment, and update anything that has drifted. Treat documentation drift as a real operational risk, not a paperwork problem.

The meta-lesson

Looking back at these five near-misses, the pattern is clear. None of them were caused by some exotic attack. None of them required deep crypto knowledge to prevent. They were all operational: a forgotten process, a missed alert, an assumption about the environment, a doc that was not maintained.

Slashing prevention is mostly not about the protocol. It is about the discipline around the protocol. The teams that get slashed are not the ones who do not know about TMKMS. They are the ones who installed TMKMS once, never tested the failure modes, and assumed it was working.

The infrastructure layers (sentry nodes, TMKMS or Horcrux, monitoring, runbooks, Cosmovisor) are the necessary foundation. But what makes them actually work is the boring stuff: alerts you respond to, drills you run, docs you update.

I wrote a more detailed technical breakdown of the actual protections (sentry topology, TMKMS setup, Horcrux for threshold signing, monitoring rules, the works) over here if you want the implementation side: https://thegoodshell.com/cosmos-validator-slashing/

What near-misses have you had on your own validator? Always curious to compare notes.