DEV Community: Sonia

Lessons from operating a Cosmos validator: a year of slashing near-misses

Sonia — Wed, 20 May 2026 15:45:07 +0000

I have been operating a Cosmos validator for just over a year. In that time, I have not been slashed once.

I have been close, multiple times.

Slashing is one of those things where the financial cost (5% of your bonded stake on a double-sign, plus permanent jailing) is dwarfed by the reputational damage. Delegators leave. You spend the next quarter rebuilding trust. The slash itself is a single moment but the consequence is months long.

So I have learned to pay attention to the near-misses. They are the actual lessons. The slashings are just the bills.

Here are the five near-misses from the last year that taught me what actually matters in validator operations.

Near-miss 1: the backup node that almost double-signed

Six months ago I was migrating my validator to a new bare-metal host. The plan was standard: bring up the new node, sync it, swap the priv_validator_key, shut down the old one.

I had done this twice before without issues. Confidence was high.

I prepared the new node, copied the consensus state, and was ready to do the cutover during a low-traffic window. The moment I started gaiad on the new node, my monitoring exploded with double-sign warnings.

What happened: the old node was still running. I thought I had shut it down. I had shut down the wrong process (there were two gaiad processes on that machine because of an earlier debugging session I had forgotten about). Both nodes were now signing.

What saved me: TMKMS. The signing service refused to sign the second block because it had already signed at that height. The validator did not actually double-sign on chain. It just looked like it was about to.

The lesson is not "be more careful with cutovers". The lesson is "your operational discipline will fail at some point, so build infrastructure that fails safe when it does". TMKMS with its state-tracking is exactly that. Without it I would have eaten a 5% slash for what was, fundamentally, a forgotten zombie process.

Near-miss 2: the upgrade I forgot was tonight

Three months ago I was traveling, sitting in an airport lounge in Madrid waiting for a delayed flight. I noticed at 11pm local time that my phone had eight Slack notifications about a Cosmos Hub governance proposal that was passing.

The upgrade was scheduled for 3am UTC. I was about to board a flight. I had no laptop access during the flight. I would land at 6am, an hour and a half before the upgrade height.

I texted my co-founder. He was asleep.

My fallback was Cosmovisor. The system watches for governance proposals, downloads the new binary when the upgrade height approaches, and swaps it at the right block height automatically. I had installed it months earlier and never actually had to rely on it for a real upgrade. Tonight was the night.

I landed at 6am, opened my laptop in the taxi, checked the validator. It was on the new version. Cosmovisor had handled the binary swap at the upgrade height, gaiad had restarted, the validator had rejoined the chain after a 12-second gap that did not even register as a missed block.

The lesson is that "I'll be around for the upgrade" is not an operational plan. You will not be around when the upgrade matters. Build infrastructure that does not need you.

Near-miss 3: the disk that filled up at 3am

Two months ago, on a Tuesday night, my disk usage alert went off at 80%. I was watching a movie. I made a mental note to clean it up in the morning.

At 3am, my phone started buzzing with critical alerts. Disk was at 99%. gaiad had crashed because it could not write to the chain database. The validator had been offline for eight minutes. I had missed roughly 45 blocks at that point, well below the 500-block jail threshold but heading in the wrong direction fast.

I got up, SSHed in, pruned old logs, freed up 40GB. Gaiad restarted. Validator caught up. No jail. But I had spent the next two hours wide awake, watching the recovery, knowing I had cut it close.

The lesson is that warning alerts are not "I'll handle it later" signals. They are "I will handle this in the next hour or my future self will pay for it". The infrastructure was correct. The monitoring was correct. The mistake was treating the alert as informational.

The change I made: warning alerts now also trigger a Slack DM to me with a 30-minute snooze. If I do not acknowledge in 30 minutes, the alert escalates to critical and pages me. The system does not trust me to remember.

Near-miss 4: the sentry that went silent

A month ago I noticed my validator was reporting only 1 peer for about 12 hours before I caught it. The sentry node architecture means the validator connects only through its sentries, never to public peers. I have two sentries in different cloud providers.

What happened: one sentry crashed silently (kernel panic, not graceful shutdown, no alert from the OS layer). The other sentry was up, but had lost most of its peers because of a separate networking issue with that cloud provider. The validator was effectively talking to one degraded sentry with maybe 4 working external peers.

If the second sentry had also gone down, the validator would have been completely partitioned. It would have kept signing whatever blocks it received from its single peer, which on a sufficiently large network partition could have meant signing a minority fork and then double-signing once the majority chain came back. Catastrophic.

The lesson is that uptime is not the right metric for sentry health. Peer count is. A sentry with 0 peers is functionally offline even if the process is running. I now monitor peer count per sentry and alert if any sentry drops below 8 peers for more than 5 minutes.

Near-miss 5: the runbook I had never actually read

Two weeks ago I got a downtime jail at 4am because of a network blip from my cloud provider. Not a near-miss in the slash sense, since I was already jailed, but a near-miss in the recovery sense.

I had a runbook for "validator jailed for downtime". I had written it six months earlier, never read it again. At 4am, half asleep, I opened it and realized two things:

The runbook referenced gaiad commands that had changed syntax in a chain upgrade three months ago. The unjail command in my runbook would have failed.
The runbook assumed I knew where the validator key was stored, but I had moved it to a different path two months ago and never updated the doc.

I fumbled the recovery for 20 minutes before I got the unjail transaction through. During that time my validator was out of the active set. Delegators saw it. I got two Slack messages from concerned delegators within the hour.

The lesson is that a runbook you have never read is fiction. Once a quarter, walk through every runbook end to end, in the actual environment, and update anything that has drifted. Treat documentation drift as a real operational risk, not a paperwork problem.

The meta-lesson

Looking back at these five near-misses, the pattern is clear. None of them were caused by some exotic attack. None of them required deep crypto knowledge to prevent. They were all operational: a forgotten process, a missed alert, an assumption about the environment, a doc that was not maintained.

Slashing prevention is mostly not about the protocol. It is about the discipline around the protocol. The teams that get slashed are not the ones who do not know about TMKMS. They are the ones who installed TMKMS once, never tested the failure modes, and assumed it was working.

The infrastructure layers (sentry nodes, TMKMS or Horcrux, monitoring, runbooks, Cosmovisor) are the necessary foundation. But what makes them actually work is the boring stuff: alerts you respond to, drills you run, docs you update.

I wrote a more detailed technical breakdown of the actual protections (sentry topology, TMKMS setup, Horcrux for threshold signing, monitoring rules, the works) over here if you want the implementation side: https://thegoodshell.com/cosmos-validator-slashing/

What near-misses have you had on your own validator? Always curious to compare notes.

The 8 Grafana panels every Cosmos validator dashboard should have (and most don't)

Sonia — Wed, 13 May 2026 15:50:47 +0000

Most validator dashboards I look at show block height and missed blocks. That tells you the node is alive. It does not tell you whether you are about to get jailed. Here are the 8 panels that change that.

I sit at an unusual intersection. My day job is marketing for a DevOps and Web3 infrastructure team. That means I spend a lot of time on calls where engineering leads share their screens, walk me through their stack, and ask if our team can help with the bits that are breaking.

Validator operators are a recurring guest on those calls. Cosmos, Solana, EVM, you name it. And after enough of those calls, a pattern shows up that I want to write down: most validator dashboards look the same, and most of them are missing the panels that actually matter.

This is not a tutorial. The PromQL is not in here. What I want to share is the pattern I keep seeing on those calls, and the reasoning behind the 8 panels that close the gap between "the node is up" and "the validator is healthy."

If you operate a Cosmos validator and the only graphs you check daily are block height and peer count, this post is for you.

The default Cosmos dashboard is a screenshot, not an operations tool

The most-imported Cosmos validator dashboards on Grafana Labs are very pretty. Big numbers, color-coded gauges, a graph of missed blocks rising over time. Operators import them on day one, take a screenshot for the company Notion page, and never look at them again until something breaks.

The problem is that "something is broken" arrives a few different ways:

The validator is signing fine but inbound peers silently dropped to zero two hours ago and you have no idea.
The chain is producing blocks slowly because of a network upgrade, and your alerting is paging you for missed blocks that everyone is missing.
The disk is full of old WAL files. The validator is queueing writes. Block production is one minute away from breaking. Nothing in your dashboard hints at it.

Those are not exotic failure modes. They are the standard playbook. And the popular dashboards have no panels for any of them.

The mental shift: "is it up" vs "is it healthy"

Most of the dashboards I see are built to answer one question: is the node up. That is a useful question for the first ten minutes after spinning up a validator. After that, the question that actually matters is: is the validator healthy enough to keep signing under pressure.

Those are different questions. "Up" is a snapshot. "Healthy" is a trajectory. A dashboard that only shows current state is going to miss every leading indicator, and you will get woken up by the consequence instead of the cause.

The 8 panels below are the ones operators look at when an incident is in progress, not the ones that get included in investor decks.

The 8 panels, and why each one matters

1. Signing efficiency rate (rolling, not absolute)

Most dashboards show a counter of missed blocks since boot. That is close to useless. A validator that missed 400 blocks during a memory upgrade six months ago and has been perfect since is in a very different situation from a validator that has missed 80 blocks in the last hour. Same counter, opposite stories.

What you want is the ratio of signed blocks within the rolling signing window your chain uses to compute jailing. For Cosmos Hub that window is 10,000 blocks. The number sits at 1.0 when everything is fine and starts moving the second something goes wrong. That movement is your earliest signal.

2. Jailing prediction window

This is the panel I have never seen on a community dashboard, and it is the one I always recommend adding first. It answers a single question: at the current rate of missed blocks, how many minutes until the validator gets jailed.

When this panel is green you have hours of buffer. When it turns yellow you need to pay attention. When it turns red you stop whatever you are doing. It is also the panel that turns a stressful incident into a structured one. You stop staring at the missed-blocks counter trying to do math. The math is already on screen.

3. Block time deviation from network median

Every now and then a Cosmos chain has a slow patch. Maybe a validator with a big stake is being restarted. Maybe a network upgrade is staggering block production. Maybe the chain is just under load.

If you do not track network-wide block time, you cannot tell whether your missed blocks are your fault or the network's. And if you cannot tell that, you end up paging people for problems they cannot fix. This panel filters out the false alarms.

4. Peer count, split by direction

This is the panel that catches a failure mode I see more often than I should: outbound peers stable at 8, dashboard looks fine, but inbound peers dropped to zero two hours ago because of a sentry NAT change nobody documented. Your validator is producing blocks but invisible to the rest of the network. Eventually mempool depth grows, propose-block rounds start failing, and you get jailed without anything on the default dashboard flinching.

A single "peer count" number hides this completely. Splitting it into inbound and outbound takes one line of dashboard config and makes the failure mode visible.

5. Local RPC p99 latency

The validator's RPC port is part of an SLA whether you realise it or not. Your alerting hits it. Your block explorer hits it. Your monitoring systems hit it. When the validator process is under pressure, RPC latency spikes first, before missed blocks start showing up.

I think of this panel as the smoke alarm. By the time the missed-blocks counter is on fire, the RPC latency graph has been smoking for ten minutes. Catching it during the smoke phase is the difference between "wake up someone tomorrow" and "page on-call now."

6. Mempool depth and rejection rate

Two metrics on one graph. Mempool depth tells you whether transactions are flowing through the validator. The rejection rate tells you whether the validator is rejecting transactions because something is wrong (block size limits, sequence mismatches, recheck failures during a fork).

A flat low mempool when the chain is busy means your peer graph is broken. A growing mempool with a spike in rejections means you are about to fail propose-block rounds. Both are early warnings of jailing risk, and neither shows up if you only watch block height.

7. Process saturation correlated with chain misses

This is the panel that closes the loop on root cause analysis. The default node-exporter dashboards show CPU, memory and disk in isolation. The chain dashboards show missed blocks in isolation. Neither one tells you whether the missed blocks are caused by IO wait, memory pressure or CPU saturation.

When I look at operators' setups, the dashboards that catch root cause fastest have these metrics overlaid on a single graph with a shared time axis. When IO wait climbs and missed blocks start incrementing in the same window, you have an IO-bound validator. That is not the kind of thing you guess at 3am. You see it.

8. Sentry reachability from the validator

If you run a sentry architecture, this panel tells you whether your validator can actually reach the sentries that are supposed to protect it. Sentries that block from the public internet but are unreachable from your validator are worse than no sentries at all. You think you are isolated and protected. You are isolated and silent.

This is one of the panels you do not need until the day you really need it. A simple TCP probe to each sentry's P2P port, displayed as a row of green or red stat panels, takes 10 minutes to set up and has saved more validators than I can count.

What the dashboard is for

If I had to summarise the difference between the popular dashboards and the one I keep seeing on healthy validator setups, it comes down to this:

The popular dashboards tell you the node is up. The good ones tell you whether the validator is in trouble before the chain notices.

That is not a question of how many panels you have. It is a question of which questions the panels are designed to answer. The 8 above are the ones that turn a dashboard into something operators actually look at during an incident, instead of something that lives in a tab nobody clicks.

If you have been operating a validator long enough to have a panel that has saved you in production and is not on this list, I would genuinely like to know which one. Drop it in the comments. The next operator setting up a dashboard from scratch will benefit from it more than from another generic import.

3 Ethereum validator decisions that look safe and aren't

Sonia — Sun, 10 May 2026 17:28:19 +0000

Most Ethereum validator incidents don't come from attacks. They come from configuration decisions that looked reasonable at setup and revealed their failure mode months later.
Three that come up repeatedly.

1. Running a 2,048 ETH consolidated validator on a single machine.

Pectra raised the maximum effective balance from 32 ETH to 2,048 ETH. Consolidating makes operational sense, fewer validator processes, simpler key management, auto-compounding. The risk that doesn't get modeled: a slashing event on a 2,048 ETH validator has proportionally larger consequences than the same event on a 32 ETH validator.
Running a 2,048 ETH validator on a single machine without DVT is not a reasonable risk posture, it's the same mistake as running a 32 ETH validator in 2020 without TMKMS. Technically possible, widely done, and wrong. The Ethereum Foundation staked 72,000 ETH using Dirk and Vouch across geographically distributed nodes. That's the reference implementation, not a niche setup.

2. Running Geth + Prysm because it's the most documented combination.

Geth is above 40% execution client market share. Prysm is above 40% consensus client share. Running both means that if a critical bug ships in either, you and thousands of other operators are exposed to the same correlated failure simultaneously. This is not a theoretical concern, it's the exact scenario that caused large-scale attestation failures before client diversity became a priority.
Running the minority client combination (Lighthouse + Nethermind, Teku + Besu) contributes to network resilience and protects your validator from correlated slashing events caused by a client-specific bug. The documentation is slightly thinner. The risk profile is significantly better.

3. Treating 32 GB RAM as sufficient after Fusaka.

Pre-Fusaka guides commonly listed 16-32 GB as the recommended spec. Fusaka activated PeerDAS in December 2025, changing how consensus clients handle blob data. By January 2026, blob parameters had reached 14/21 target/max. Running both execution and consensus clients on 32 GB under post-Fusaka blob load produces memory pressure during peak network activity that manifests as missed attestations.
64 GB is the practical floor for a production validator in 2026. Not the upper end of the recommended range, the minimum for stable operation. If you're running 32 GB today, check your consensus client memory headroom during peak blob propagation before the next BPO increase hits.

Platform engineering vs DevOps: the decision most growing startups get backwards

Sonia — Thu, 30 Apr 2026 11:30:35 +0000

Platform engineering is not a replacement for DevOps. It's what happens when DevOps works well enough that it creates a new problem.
Here's the sequence most teams miss.

DevOps solves the wall between dev and ops.

Developers own deployments. Everyone automates. Software ships faster. This works well up to 30-50 engineers. Every team manages their own infrastructure. It's messy but manageable.
Then scale kicks in. At 80-100 engineers, "everyone owns their infrastructure" means: 12 teams with 12 different CI/CD setups, 12 different Kubernetes patterns, 12 different approaches to secret management. A new engineer needs weeks to understand how deployments work. A security audit reveals inconsistency everywhere. Senior engineers spend 30% of their time answering other teams' infrastructure questions.

DevOps didn't fail. It created the conditions for a new problem.

Platform engineering solves that problem by building an Internal Developer Platform, a product whose users are your own developers. Instead of each team configuring Kubernetes from scratch, they click "Create New Service", fill a three-line form, and get a fully configured service with pipelines, monitoring, and compliance baked in.
The distinction that matters operationally:
DevOps: every developer owns their infrastructure
Platform engineering: every developer consumes infrastructure through self-service
The platform team doesn't answer tickets. They build the tooling that eliminates the tickets.

The signals that tell you platform engineering is necessary:

Setting up a new service takes more than a day. Your infrastructure team is answering requests rather than building. A security audit reveals inconsistent configurations across teams. Onboarding takes weeks because there are too many different setups to learn.

If none of those apply, DevOps is still the right answer for your stage. Platform engineering before the pain appears is overengineering. Platform engineering after the pain appears is recovery.

3 on-call rotation mistakes that burn out your best engineers first

Sonia — Wed, 29 Apr 2026 10:32:46 +0000

The engineers who leave over on-call are rarely the ones who complain about it. They're the ones who quietly absorb everything, resolve incidents fast, never escalate, and one day accept an offer somewhere else. By the time you notice the pattern, you've already lost the person the rotation was grinding down.
Three mistakes that create that outcome.

Measuring shifts per engineer instead of load per engineer. Equal shifts are not equal load. A week with two P1 incidents resolved in 20 minutes each is not the same as a week with twelve alerts that each require 45 minutes of investigation at 2am. If you track only who was on-call and not what that shift actually cost, you will consistently underestimate the burden on your senior engineers, who resolve things faster but get paged more often because they're trusted to handle anything. Track actionable pages per shift per engineer. If one person consistently receives 3x the load of others, the rotation is broken regardless of how the calendar looks. The fix is alert hygiene first (delete alerts nobody acts on for 30 consecutive days), then rebalance the schedule based on load data, not headcount fairness.
Putting engineers on independent on-call before shadow shifts. The correct progression before anyone carries the pager alone: observer phase (receive all the same pages, take no action, watch how the primary responds), then reverse shadow (lead the response with an experienced engineer watching), then independent. Skipping this costs you higher MTTR on every incident that engineer handles alone, plus an experience that makes on-call feel dangerous rather than manageable. Four to six weeks of partial senior engineer time upfront costs significantly less than the first major incident where an unprepared engineer makes it worse.
Treating on-call as part of the job with no additional recognition. An engineer paged three times outside business hours in a single week and expected to deliver full sprint capacity the following week is being asked to absorb a cost that isn't being acknowledged. This doesn't require complex compensation structures. Time in lieu for overnight pages, reduced sprint commitment after heavy on-call weeks, or explicit acknowledgment in performance reviews are all sufficient. The failure mode is pretending the cost doesn't exist. If Opsgenie is still in your stack: end-of-support is April 5, 2027. If your runbooks and escalation policies live inside it, export everything now. The format doesn't migrate cleanly into alternatives.

4 Cosmos validator mistakes that get you slashed at 3am

Sonia — Wed, 22 Apr 2026 10:46:48 +0000

Cosmos validator slashing is almost entirely preventable. The operators who get slashed aren't usually victims of sophisticated attacks — they're running without one or more of the protection layers that professional validators treat as non-negotiable. Here are the four mistakes that show up most often.

Confusing double-sign with downtime: they are not the same thing. Most validators know about slashing in the abstract. Fewer understand that the two slashing conditions have completely different consequences:

Downtime: Miss more than 500 of the last 10,000 blocks → 0.01% slash, 10-minute jail. You can unjail, rejoin the active set, and recover. Delegators will notice, but it's survivable.

Double-signing: Sign two conflicting blocks at the same height → 5% slash, permanent jail. You cannot unjail after a double-sign. Your delegators lose 5% of their stake and you lose your validator permanently.

The reason this distinction matters operationally: double-signing almost never happens from attacks. It happens when an operator runs a backup validator node without proper safeguards and both nodes come online simultaneously. The "I'll just spin up a second node as a failover" approach is exactly how you trigger a permanent 5% slash.

Using a backup node instead of TMKMS or Horcrux. The correct answer to "what if my validator goes down?" is not a hot standby. It's key management.

TMKMS (Tendermint Key Management System) extracts the signing key from your validator node into a separate process. It tracks which blocks have been signed and refuses to sign conflicting blocks; double-sign protection at the signing layer, not the infrastructure layer. If someone compromises your validator host, they don't get the key.

Horcrux goes further: it splits your private key into shares using multi-party computation. You configure a threshold, say 2-of-3, so no single server holds the complete key. An attacker needs to compromise multiple servers simultaneously. And if one Horcrux node goes offline, the others still have quorum to sign, so you get high availability without the double-sign risk of running a hot standby.

The setup difference: TMKMS is a single process that protects the key. Horcrux is a distributed cluster that eliminates the single point of failure entirely. For validators with significant stake, Horcrux is the standard.

Monitoring at the wrong threshold. If your alert fires when you're jailed, it's too late.

The Cosmos Hub jails you at 500 missed blocks out of 10,000. Most people set their alert at 500. By the time the alert fires, you're already jailed and the 0.01% slash has happened.

The right approach is two alerts:


- alert: ValidatorMissedBlocks

  expr: increase(cosmos_validator_missed_blocks_total[10m]) > 10

  for: 2m

  labels:

    severity: warning

- alert: ValidatorJailRisk

  expr: cosmos_validator_missed_blocks_total > 400

  for: 1m

  labels:

    severity: critical

The warning gives you early signal. The critical fires at 400 - 80% of the jail threshold, when you still have time to intervene. The critical alert should go to PagerDuty, not just Slack. If it pages at 3am and nobody wakes up, you're jailed before anyone sees the message.

Not using Cosmovisor for chain upgrades. Chain upgrades cause a disproportionate share of slashing events. The validator misses the upgrade block, falls behind, and gets jailed for downtime. Or the operator runs the old binary past the upgrade height and ends up on the wrong fork.

Cosmovisor solves this. It watches for upgrade governance proposals, downloads the new binary, and swaps it automatically at the correct block height, no manual intervention required.

export DAEMON_NAME=gaiad

export DAEMON_HOME=$HOME/.gaia

export DAEMON_ALLOW_DOWNLOAD_BINARIES=true

export DAEMON_RESTART_AFTER_UPGRADE=true

cosmovisor run start

The alternative is manually monitoring governance, tracking upgrade heights, and being online at the exact moment the upgrade executes. In practice this means either a lot of alerting overhead or missing upgrades when the timing is inconvenient. Cosmovisor eliminates the category of risk entirely.

The layer most people skip: runbooks.
All the monitoring in the world doesn't help if the person who gets paged at 3am doesn't know what to do. The minimum runbook set for a Cosmos validator covers three scenarios: jailed for downtime, disk space critical, and sentry node offline. At 3am you don't want to be googling the unjail command or figuring out which log to check first.

The full guide: including the complete TMKMS and Horcrux configurations, sentry node setup, and all seven protection layers, is at thegoodshell.com.

Happy to answer questions in the comments if you are working through any of these.

SRE vs DevOps: the sequencing mistake that burns most startups.

Sonia — Mon, 20 Apr 2026 14:45:51 +0000

Most startups approach the SRE vs DevOps question wrong. They ask "which is better?" when the real question is "which do I need right now and in what order?"

After seeing this play out across a lot of engineering teams, the mistake is almost always the same: hiring the wrong role at the wrong stage. Here's what actually matters.

The one sentence that cuts through the noise.

A DevOps engineer makes it easier to ship software. An SRE makes sure that software stays running once it's shipped.

That's it. Every other difference: tooling, seniority, day-to-day work, follows from this. If your bottleneck is shipping, you have a DevOps problem. If your bottleneck is staying up, you have an SRE problem. The mistake is treating them as interchangeable or assuming you need both simultaneously from the start.

The sequencing trap most startups walk into.

This is the one that costs real money: hiring an SRE before a DevOps foundation exists.

An SRE without a functioning CI/CD pipeline is like hiring a Formula 1 engineer to fix a car that doesn't have wheels yet. The skills don't transfer down. An SRE wants to define SLOs, build error budgets, and design incident response processes. None of that is useful when your deployments still involve someone SSH-ing into a server and running a script manually.

The correct sequencing is almost always:

DevOps engineer to build the foundation: pipeline, IaC, basic monitoring.
SRE practices once you have production traffic and the foundation is stable.
Dedicated SRE hire when incident volume justifies it.

If you skip step one, you'll waste step two.

The specific signals that tell you which one you need.

"We have reliability problems" isn't specific enough. These are the actual triggers:

You need a DevOps engineer when:

Deployments involve manual steps or specific people who need to be online.
Onboarding a new engineer takes more than a day of environment setup.
Your cloud costs are growing without obvious cause (IaC discipline prevents sprawl).
Your CI/CD either doesn't exist or isn't trusted by the team.

You need an SRE when:

Your MTTR (mean time to recovery) is consistently above two hours.
You have users but no defined answer to "what's our acceptable downtime per month?".
Your monitoring produces alerts but no context; engineers get paged and their first action is "let me figure out where to look".
You're running validator nodes, RPC endpoints, or other infrastructure where availability is contractual or financial.

That last point is worth calling out. For Web3 infrastructure: validators, nodes, RPC endpoints, the tolerance for downtime is near-zero and the consequences of an incident are immediate and financial. SRE thinking is not optional there; it's the baseline.

What SREs actually bring that DevOps engineers don't.

The biggest conceptual gap between the roles is the error budget. An SRE defines an SLO (service level objective) say, 99.9% availability and then tracks how much of that budget has been consumed. When the budget is burned, they have the authority to stop feature shipping until reliability is restored.

This is not a culture DevOps engineers typically build. A DevOps engineer optimises the delivery pipeline; they're not usually responsible for making the reliability vs. velocity tradeoff explicit. An SRE makes that tradeoff quantitative and enforced.

The practical consequence: a great SRE will tell you your product's reliability strategy is wrong. A great DevOps engineer will make your current strategy execute more smoothly. Both are valuable, but they're solving different problems.

When one person can do both.

At early stage, yes and it's often the most efficient path. A senior engineer with both DevOps and SRE skills (sometimes called a Platform Engineer) can own the full stack: pipeline, monitoring, first SLOs, on-call rotation.

This person is expensive and not easy to find. But for a Series A startup with one infrastructure hire, this is the profile that gives you the most coverage without over-hiring into specialisation you don't need yet.

The roles diverge at scale. Platform teams own the tooling. SRE teams own reliability. That's a Series B+ problem.

The full breakdown including how this applies to outstaffing and what it looks like to bring in the right skills on a project basis.

Happy to answer questions in the comments if you are working through any of these.

5 GitHub Actions mistakes that will slow down (or break) your CI/CD pipeline.

Sonia — Sat, 18 Apr 2026 11:33:43 +0000

Most GitHub Actions tutorials get you to a green checkmark. Very few of them help you understand why your pipeline takes 8 minutes when it should take 2, or why your production deploy triggered from a feature branch PR at 11pm on a Friday.

After working with a lot of engineering teams setting up CI/CD from scratch, these are the patterns that come up again and again.

1. You're not caching dependencies and it's costing you minutes per run.

The single fastest win in any GitHub Actions pipeline is dependency caching. Most people skip it because the pipeline "works." It does work. It's just running npm install or pip install from scratch on every single run.

- name: Cache node modules
  uses: actions/cache@v4
  with:
    path: ~/.npm
    key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
    restore-keys: |
      ${{ runner.os }}-node-

The hashFiles key is the part that matters: the cache invalidates automatically when your lockfile changes, so you always get fresh deps when you actually update something. When it hits, you skip the install entirely. On a mid-size Node project, this typically cuts 2–4 minutes per run.

2. You're pushing to Docker Hub when GHCR is sitting right there.

GitHub Container Registry (GHCR) is built into GitHub and works with the GITHUB_TOKEN that already exists in every workflow. No extra secrets, no separate account, no rate limiting surprises.

The catch that trips people up: you need to explicitly grant the packages: write permission in your job definition. Without it, the push will fail with a misleading auth error.

jobs:
  build-and-push:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write   # ← this line is required, not optional

Then authenticate like this:

- name: Log in to GHCR
  uses: docker/login-action@v3
  with:
    registry: ghcr.io
    username: ${{ github.actor }}
    password: ${{ secrets.GITHUB_TOKEN }}

No secrets to rotate, no third-party dependency, and images are scoped to your repo automatically.

3. Your production deploy is one accidental push away from triggering.

If your workflow deploys to production on every push to main, that's fine until someone force-pushes a fix, a bot commits a version bump, or a merge goes sideways.

The pattern that solves this is a separate deploy job with an if condition and needs chaining:

deploy-production:
  runs-on: ubuntu-latest
  needs: [lint-and-test, build-and-push]
  if: github.ref == 'refs/heads/main' && github.event_name == 'push'
  environment: production

The environment: production line is the one most people miss. If you've configured environment protection rules in GitHub (Settings → Environments), this gates the deploy behind required reviewers or a manual approval. It's free on public repos and included in Team plans for private ones.

This means: automated deploys from main, but with a human checkpoint before anything touches production.

4. You're putting everything in `secrets` when half of it should be in `vars`.

GitHub has two distinct places for pipeline configuration: Secrets (encrypted, write-only, for credentials) and Variables (plaintext, readable in UI, for config values).

Most teams put everything in secrets. That means your APP_ENV=production or LOG_LEVEL=info is encrypted and invisible in the GitHub UI, which makes debugging and auditing unnecessarily painful.

Variables are accessed with the vars context:

env:
  APP_ENV: ${{ vars.APP_ENV }}
  LOG_LEVEL: ${{ vars.LOG_LEVEL }}
  DATABASE_URL: ${{ secrets.DATABASE_URL }}  # this one actually needs to be a secret

Practical rule: if the value isn't a credential, a token, or a key, it belongs in vars.

5. You're pinning `actions/checkout@v4` but running on `ubuntu-latest`.

This is a subtle one. Pinning action versions (e.g., actions/checkout@v4) is good practice, it prevents upstream changes from breaking your pipeline without warning.

But then running runs-on: ubuntu-latest undoes some of that stability. ubuntu-latest is an alias that GitHub updates periodically (currently ubuntu-24.04, soon to rotate again), and those updates can change pre-installed tool versions, breaking pipelines that depend on system-level tools.

If stability matters more than getting the latest runner features:

runs-on: ubuntu-22.04   # pinned, not latest

You'll need to update it manually when the version reaches end-of-life, but you control when that happens, not GitHub's release schedule.

These are the patterns that separate a "it works" pipeline from one that's actually reliable in production. The full step-by-step guide covering the complete pipeline structure, including Kubernetes deploy jobs, multi-environment promotion workflows, and secrets management at scale, is at thegoodshell.com.

Happy to answer questions in the comments if you are working through any of these.

Beyond Meta Tags: The SRE’s Guide to Ranking in 2026

Sonia — Tue, 14 Apr 2026 09:08:54 +0000

We have been told for years that "Content is King." But in the high-stakes world of 2026, if your infrastructure is sluggish, your king is invisible.

Working at The Good Shell, I’ve spent the last few months analyzing a recurring pattern among high-growth SaaS and Web3 startups: they have world-class frontend talent and aggressive SEO targets, yet their organic growth is stagnant. After auditing several stacks, the diagnosis is almost always the same. It’s not the keywords. It's the "Technical Debt" living in the infrastructure.

If you are a developer or an SRE, this is why your infrastructure is the most powerful SEO tool you have.

1. The Death of the "Static" SEO Mindset

SEO used to be about what was on the page. Now, it’s about how that page is delivered. Google’s crawlers now operate with a strictly optimized "Crawl Budget."

If your server takes 800ms to respond because your K8s ingress is misconfigured or your database queries are unindexed, Googlebot will simply leave. It’s not that your content isn't good—it’s that Google cannot afford the computational cost to wait for your server.
The takeaway: A slow TTFB (Time to First Byte) is an immediate ranking penalty

2. The Hydration Trap in Modern Frameworks

We all love Next.js, Remix, and Nuxt. But "Hydration" is often where SEO goes to die.

When your infrastructure isn't tuned for Streaming SSR (Server-Side Rendering), the browser spends too much time executing JavaScript before the page becomes "Stable." This tanks your CLS (Cumulative Layout Shift) and LCP (Largest Contentful Paint).

At The Good Shell, we recently helped a client move logic from the heavy main server to the Edge. By utilizing Edge Middleware to handle geo-location and A/B testing instead of doing it at the origin, we dropped the LCP by 1.2 seconds. That change alone moved them from the second page of Google to the top 3 spots for their main keywords.

3. Scaling Infrastructure vs. Search Stability

One thing people rarely discuss is how infrastructure instability affects indexation.

Imagine Googlebot crawls your site during a deployment. If your CI/CD pipeline doesn't handle Zero-Downtime Deployments correctly, or if your health checks are too slow to pull a failing pod out of the rotation, the crawler hits a 5xx error.

To Google, a 5xx error isn't just a temporary glitch; it's a signal of unreliability. If it happens twice, your crawl frequency drops.

Pro-tip: Use tools like Prometheus and Grafana not just to monitor "Uptime," but to monitor "Crawl Health." If you see an increase in 4xx/5xx errors coinciding with your deployment windows, your SEO is bleeding.

4. The FinOps of SEO: Efficiency is a Feature

There is a direct correlation between resource efficiency and performance. An over-provisioned, messy Kubernetes cluster is often a slow one.

When we talk about FinOps (Cloud Cost Optimization), we aren't just saving money. We are removing the overhead that adds latency.

Over-instrumentation: Too many sidecars in your service mesh can add micro-latencies that aggregate.

Database Contention: Slow DB responses kill your TTFB.

By cleaning up the architecture, you aren't just lowering the AWS bill; you are giving Googlebot a "green light" to crawl more of your site, faster.

Conclusion: The Bridge

Technical SEO in 2026 is no longer about "tricking" a search engine. It’s about building a bridge between Marketing and SRE.

If you want to stay competitive:

Move logic to the Edge whenever possible.

Audit your TTFB with the same intensity you audit your code.

Bring SREs into the SEO conversation. Infrastructure isn't just a cost center; it's the foundation of your growth strategy. If the foundation is shaky, the skyscraper will never reach the clouds.

I’m curious—how many of you have seen a direct correlation between infrastructure upgrades and organic traffic? Let’s discuss in the comments.

Four things that will get your Cosmos validator slashed before you earn a single block reward

Sonia — Tue, 07 Apr 2026 16:04:00 +0000

The most dangerous moment in a Cosmos validator setup is not the on-chain registration. It is the ten minutes before it, when your priv_validator_key.json is sitting unprotected on the validator host and you are about to run create-validator for the first time.
Most guides walk you through the steps. Fewer of them tell you the specific things that will get you jailed or slashed if you skip them. These are four of them, from running validators on Cosmos Hub mainnet.

1. NVMe is not optional, it is the difference between signing blocks and missing them

Every guide lists "4TB SSD" as a hardware requirement. What most of them do not emphasize is that SATA SSDs and standard HDDs will cause I/O bottlenecks under load that manifest directly as missed blocks.
The chain data on Cosmos Hub has grown significantly. Under normal operation, the node is continuously reading and writing to disk. During governance-triggered upgrades, that load spikes. If your disk cannot keep up, the node falls behind on block processing and starts missing signatures.
NVMe specifically matters because the throughput difference between NVMe and SATA SSD is not marginal. It is the difference between a node that stays in sync under pressure and one that starts accumulating missed blocks at exactly the moment you can least afford it.
RAM is the second one people underestimate. You need 64GB. The 32GB setups work fine in normal operation. They fail during upgrades, when memory spikes well above the normal operating baseline. Running out of memory at upgrade height is a jailing event.

Never set DAEMON_ALLOW_DOWNLOAD_BINARIES=true in Cosmovisor This feels counterintuitive. Cosmovisor's auto-download feature sounds useful, you stage the upgrade in governance, and Cosmovisor downloads and swaps the binary automatically at the right block height. The problem is what happens when the download fails. If the binary cannot be fetched at upgrade height, the node halts immediately. You are now racing to manually place the binary before the jailing threshold kicks in. On Cosmos Hub, that window is approximately 500 blocks, around 16 minutes at normal block times. The safer pattern is to always pre-place upgrade binaries manually in the Cosmovisor upgrade directory before the governance proposal passes. You monitor the proposal, you compile and verify the binary, you put it in place. Cosmovisor finds it already there and does the swap cleanly. DAEMON_ALLOW_DOWNLOAD_BINARIES=false forces you into this pattern. It removes the failure mode where an auto-download kills your uptime at exactly the worst moment.

3. The migration double-sign window is where most slashing events happen

Double-sign slashing is permanent. It does not unjail. The tombstone is final.
The scenario that causes it most often is not a configuration mistake during initial setup. It is a validator migration: moving from one host to another. The sequence that causes it:
Old node is stopped. New node is started. Old node process was not actually stopped, or was restarted by a systemd restart policy, or a snapshot was used and the old node resumed from a state that did not reflect the stop.
Both nodes are now signing with the same key. Double-sign event. Tombstone.
The protection is simple but must be deliberate. When migrating: stop the old node, wait for a minimum of 10 confirmed blocks with no signing activity from that key, then start the new node. Never start the new node and then stop the old one. Never assume a stop command worked without verifying it.
Setting double_sign_check_height to a non-zero value in config.toml (10 to 20 blocks is standard) adds a second layer. The node will check recent block history before signing and refuse to sign if it detects a potential double-sign situation.

4. The sentry architecture is what keeps your validator IP off the public internet

A validator without sentry nodes has its IP address visible in the P2P network. That is a DDoS target. Taking your validator offline long enough to miss 5% of blocks in a sliding window triggers jailing on Cosmos Hub.
The sentry pattern is straightforward: two or more public-facing full nodes handle all external P2P connections. The validator node only connects to the sentries, never to the broader network. Its IP is never gossiped to peers.
On the validator node, this means pex = false and persistent_peers pointing only to the sentry node IDs. On the sentry nodes, the validator node ID is listed in private_peer_ids so its address is never shared with the network.
Run sentries in at least two different geographic regions and on different providers. A DDoS that takes down one sentry is neutralised if the second is on a separate network.

These four are the ones that cause the most production incidents on Cosmos validators: the hardware under-specification, the auto-download failure mode, the migration double-sign window, and the missing sentry layer. The rest of the setup, Go installation, gaiad build, state sync, TMKMS configuration, on-chain registration, is more mechanical.
If you want the full setup with all the configuration files and commands from start to production, I wrote a detailed guide covering the complete process:
Cosmos Validator Setup: The Ultimate Step-by-Step Guide for 2026
Happy to answer questions in the comments if you are working through any of these.

Bootnode Security: 6 Essential Hardening Layers to Protect Your Web3 Network

Sonia — Tue, 31 Mar 2026 08:38:05 +0000

If you run a blockchain network private, permissioned, or public you have at least one bootnode. Almost nobody has hardened it properly.
This is understandable. Bootnodes are infrastructure plumbing. They don't hold keys, they don't sign transactions. The assumption is that if a bootnode goes down, the network just loses peer discovery for a while. That assumption is wrong.
Here's what a compromised bootnode actually enables: eclipse attacks. An attacker who controls your bootnode can feed newly joining nodes a list of attacker-controlled peers. Those nodes then sync from attacker-controlled infrastructure. For a DeFi protocol or validator, this creates conditions for double-spend attacks, transaction censorship, and consensus manipulation.
A January 2026 paper on arXiv demonstrated the first practical end-to-end eclipse attack against post-Merge Ethereum execution layer nodes. This is not theoretical anymore.
This guide covers 6 hardening layers that every production bootnode needs.

The Real Threat Model

Before writing a single firewall rule, understand what you're actually defending against:
DDoS against the discovery port bootnodes run UDP on port 30303 by default. UDP is stateless and easy to flood. A sustained attack takes down peer discovery for your entire network.
Enode key compromise the enode private key is your bootnode's identity. If an attacker steals it, they can impersonate your bootnode indefinitely with a node your network trusts.
Eclipse attacks via discovery poisoning — attackers inject malicious nodes into a target's peer database using passive discovery behavior. A bootnode without rate limiting amplifies this attack.
*Sybil attacks against the discovery table * bootnodes maintain a Kademlia-style table with 17 K-buckets, each holding up to 16 nodes. A Sybil attacker floods the table with controlled node IDs, crowding out legitimate peers. New nodes then get routed exclusively to attacker-controlled infrastructure.

Layer 1 - Host Hardening

Run nothing else on the bootnode host. Minimal attack surface is not optional.

# Disable unnecessary services
systemctl disable --now snapd cups avahi-daemon bluetooth

# SSH hardening /etc/ssh/sshd_config
Port 22222
PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
AllowUsers bootnode-admin
MaxAuthTries 3
X11Forwarding no
AllowTcpForwarding no

Store the enode key on an encrypted volume:

cryptsetup luksFormat /dev/sdb
cryptsetup luksOpen /dev/sdb bootnode-keys
mkfs.ext4 /dev/mapper/bootnode-keys
mount /dev/mapper/bootnode-keys /mnt/bootnode-keys
chmod 700 /mnt/bootnode-keys

Layer 2 - Network Hardening

This is where most bootnode security implementations fall apart. The default allows connections from any IP on any port. Fine for getting started. Not acceptable in production.

ufw default deny incoming
ufw default allow outgoing
ufw allow from <MANAGEMENT_IP> to any port 22222 proto tcp
ufw allow 30303/udp
ufw allow 30303/tcp
ufw enable

Rate limit UDP with iptables UFW alone doesn't rate-limit UDP:

iptables -A INPUT -p udp --dport 30303 -m hashlimit \
  --hashlimit-name udp-discovery \
  --hashlimit-above 100/second \
  --hashlimit-burst 200 \
  --hashlimit-mode srcip \
  -j DROP

For private/permissioned networks: restrict discovery to known IP ranges. There is no reason your bootnode should accept requests from arbitrary internet IPs.

ufw allow from <NODE_IP_RANGE>/24 to any port 30303
ufw deny 30303

This single change is the most impactful improvement for private networks and almost nobody does it.

Layer 3 - Enode Key Management

Generate the key before starting the node. Never let the client auto-generate it.

# Generate and record the public key
bootnode -genkey /mnt/bootnode-keys/bootnode.key
bootnode -nodekey /mnt/bootnode-keys/bootnode.key -writeaddress

# Secure permissions
chmod 400 /mnt/bootnode-keys/bootnode.key
chown bootnode-service:bootnode-service /mnt/bootnode-keys/bootnode.key

Systemd with sandboxing:

# /etc/systemd/system/bootnode.service
[Service]
User=bootnode-service
ExecStart=/usr/local/bin/bootnode \
  -nodekey /mnt/bootnode-keys/bootnode.key \
  -addr :30303
Restart=always
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ReadWritePaths=/mnt/bootnode-keys

Back up the key to offline storage immediately. The offline backup must be tested, not just created.

Layer 4 - Eclipse Attack Prevention

Run at least 3 geographically distributed bootnodes across different cloud providers. An attacker needs to compromise all three simultaneously to control peer discovery.

# Each node points to all bootnodes
geth --bootnodes \
  "enode://<pubkey1>@<ip1>:30303,enode://<pubkey2>@<ip2>:30303,enode://<pubkey3>@<ip3>:30303"

Each bootnode lists the others for faster discovery and resilience.
Enable ENR/Discv5 where supportedit includes cryptographic verification that makes node impersonation significantly harder than legacy enode.

Layer 5 - Monitoring and Alerting

# Prometheus alerting rules
groups:
  - name: bootnode.security
    rules:
      - alert: BootnodeDown
        expr: up{job="bootnode"} == 0
        for: 2m
        labels:
          severity: critical

      - alert: BootnodePeerCountDrop
        expr: p2p_peers < 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low peer count possible eclipse or DDoS"

      - alert: BootnodeUDPFlood
        expr: rate(net_p2p_ingress_bytes_total[1m]) > 50000000
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Possible DDoS on discovery port"

Layer 6 - Disaster Recovery and Key Rotation

If the bootnode key is compromised, you need a pre-defined rotation procedure. Test it before you need it.

# Generate new key on new instance
bootnode -genkey /mnt/keys/bootnode-new.key
bootnode -nodekey /mnt/keys/bootnode-new.key -writeaddress > new-enode-pubkey.txt

# Push new enode to all network nodes via Ansible
# Bring up new bootnode
systemctl start bootnode-new

# After confirming healthy take down compromised node
systemctl stop bootnode-old

Multi-region deployment is non-negotiable for production:

Region 1 (AWS eu-west-1) elastic IP
Region 2 (Hetzner Helsinki) static IP
Region 3 (GCP us-east1) static IP

Different providers means a cloud-level outage doesn't take down your entire discovery layer.
The Quick Checklist
Before deploying any production bootnode:
Host: dedicated host, SSH on non-standard port, key-only auth, disk encryption for keys, systemd sandboxing.
Network: UFW default deny, UDP rate limiting, SSH restricted to management IP, IP allowlisting for private networks.
Enode key: generated pre-start, encrypted volume, 400 permissions, offline backup tested, rotation runbook documented.
Architecture: minimum 3 bootnodes, cross-region, cross-provider, cross-referencing each other.
Monitoring: Prometheus scraping, alerts on down/peer drop/UDP flood/SSH failures.

Wrapping Up

Bootnode security is the gap between "we have a network" and "we have a network that can't be trivially disrupted." Eclipse attacks against post-Merge Ethereum were demonstrated in peer-reviewed research in January 2026. The technical foundation has existed since 2018.
None of this is exotic. Every protection here is standard Linux and networking practice applied to a blockchain-specific context. One solid day of work. The result is a bootnode that withstands DDoS, resists eclipse attempts, and survives key compromise with a clean rotation procedure.
Questions? Drop them in the comments happy to go deeper on any of these layers.

Originally published at thegoodshell.com

DEV Community: Sonia

Lessons from operating a Cosmos validator: a year of slashing near-misses

Near-miss 1: the backup node that almost double-signed

Near-miss 2: the upgrade I forgot was tonight

Near-miss 3: the disk that filled up at 3am

Near-miss 4: the sentry that went silent

Near-miss 5: the runbook I had never actually read

The meta-lesson

The 8 Grafana panels every Cosmos validator dashboard should have (and most don't)

The default Cosmos dashboard is a screenshot, not an operations tool

The mental shift: "is it up" vs "is it healthy"

The 8 panels, and why each one matters

1. Signing efficiency rate (rolling, not absolute)

2. Jailing prediction window

3. Block time deviation from network median

4. Peer count, split by direction

5. Local RPC p99 latency

6. Mempool depth and rejection rate

7. Process saturation correlated with chain misses

8. Sentry reachability from the validator

What the dashboard is for

3 Ethereum validator decisions that look safe and aren't

1. Running a 2,048 ETH consolidated validator on a single machine.

2. Running Geth + Prysm because it's the most documented combination.

3. Treating 32 GB RAM as sufficient after Fusaka.

Platform engineering vs DevOps: the decision most growing startups get backwards

3 on-call rotation mistakes that burn out your best engineers first

4 Cosmos validator mistakes that get you slashed at 3am

SRE vs DevOps: the sequencing mistake that burns most startups.

The one sentence that cuts through the noise.

The sequencing trap most startups walk into.

The specific signals that tell you which one you need.

What SREs actually bring that DevOps engineers don't.

When one person can do both.

5 GitHub Actions mistakes that will slow down (or break) your CI/CD pipeline.

1. You're not caching dependencies and it's costing you minutes per run.

2. You're pushing to Docker Hub when GHCR is sitting right there.

3. Your production deploy is one accidental push away from triggering.

4. You're putting everything in secrets when half of it should be in vars.

5. You're pinning actions/checkout@v4 but running on ubuntu-latest.

Beyond Meta Tags: The SRE’s Guide to Ranking in 2026

1. The Death of the "Static" SEO Mindset

2. The Hydration Trap in Modern Frameworks

3. Scaling Infrastructure vs. Search Stability

4. The FinOps of SEO: Efficiency is a Feature

Conclusion: The Bridge

Four things that will get your Cosmos validator slashed before you earn a single block reward

1. NVMe is not optional, it is the difference between signing blocks and missing them

3. The migration double-sign window is where most slashing events happen

4. The sentry architecture is what keeps your validator IP off the public internet

Bootnode Security: 6 Essential Hardening Layers to Protect Your Web3 Network

The Real Threat Model

Layer 1 - Host Hardening

Layer 2 - Network Hardening

Layer 3 - Enode Key Management

Layer 4 - Eclipse Attack Prevention

Layer 5 - Monitoring and Alerting

Layer 6 - Disaster Recovery and Key Rotation

Wrapping Up

4. You're putting everything in `secrets` when half of it should be in `vars`.

5. You're pinning `actions/checkout@v4` but running on `ubuntu-latest`.