<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sonia</title>
    <description>The latest articles on DEV Community by Sonia (@soniarotglam).</description>
    <link>https://dev.to/soniarotglam</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3851636%2F471b4282-3fec-47f2-a25a-b22bdc3c0897.jpeg</url>
      <title>DEV Community: Sonia</title>
      <link>https://dev.to/soniarotglam</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/soniarotglam"/>
    <language>en</language>
    <item>
      <title>Platform engineering vs DevOps: the decision most growing startups get backwards</title>
      <dc:creator>Sonia</dc:creator>
      <pubDate>Thu, 30 Apr 2026 11:30:35 +0000</pubDate>
      <link>https://dev.to/soniarotglam/platform-engineering-vs-devops-the-decision-most-growing-startups-get-backwards-4cgb</link>
      <guid>https://dev.to/soniarotglam/platform-engineering-vs-devops-the-decision-most-growing-startups-get-backwards-4cgb</guid>
      <description>&lt;p&gt;Platform engineering is not a replacement for DevOps. It's what happens when DevOps works well enough that it creates a new problem.&lt;br&gt;
Here's the sequence most teams miss.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DevOps solves the wall between dev and ops.&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;Developers own deployments. Everyone automates. Software ships faster. This works well up to 30-50 engineers. Every team manages their own infrastructure. It's messy but manageable.&lt;br&gt;
Then scale kicks in. At 80-100 engineers, "everyone owns their infrastructure" means: 12 teams with 12 different CI/CD setups, 12 different Kubernetes patterns, 12 different approaches to secret management. A new engineer needs weeks to understand how deployments work. A security audit reveals inconsistency everywhere. Senior engineers spend 30% of their time answering other teams' infrastructure questions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DevOps didn't fail. It created the conditions for a new problem.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Platform engineering solves that problem by building an Internal Developer Platform, a product whose users are your own developers. Instead of each team configuring Kubernetes from scratch, they click "Create New Service", fill a three-line form, and get a fully configured service with pipelines, monitoring, and compliance baked in.&lt;br&gt;
The distinction that matters operationally:&lt;br&gt;
DevOps: every developer owns their infrastructure&lt;br&gt;
Platform engineering: every developer consumes infrastructure through self-service&lt;br&gt;
The platform team doesn't answer tickets. They build the tooling that eliminates the tickets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The signals that tell you platform engineering is necessary:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Setting up a new service takes more than a day. Your infrastructure team is answering requests rather than building. A security audit reveals inconsistent configurations across teams. Onboarding takes weeks because there are too many different setups to learn.&lt;/p&gt;

&lt;p&gt;If none of those apply, DevOps is still the right answer for your stage. Platform engineering before the pain appears is overengineering. Platform engineering after the pain appears is recovery.&lt;/p&gt;

</description>
      <category>software</category>
      <category>devops</category>
      <category>kubernetes</category>
      <category>startup</category>
    </item>
    <item>
      <title>3 on-call rotation mistakes that burn out your best engineers first</title>
      <dc:creator>Sonia</dc:creator>
      <pubDate>Wed, 29 Apr 2026 10:32:46 +0000</pubDate>
      <link>https://dev.to/soniarotglam/3-on-call-rotation-mistakes-that-burn-out-your-best-engineers-first-4da6</link>
      <guid>https://dev.to/soniarotglam/3-on-call-rotation-mistakes-that-burn-out-your-best-engineers-first-4da6</guid>
      <description>&lt;p&gt;The engineers who leave over on-call are rarely the ones who complain about it. They're the ones who quietly absorb everything, resolve incidents fast, never escalate, and one day accept an offer somewhere else. By the time you notice the pattern, you've already lost the person the rotation was grinding down.&lt;br&gt;
Three mistakes that create that outcome.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Measuring shifts per engineer instead of load per engineer.
Equal shifts are not equal load. A week with two P1 incidents resolved in 20 minutes each is not the same as a week with twelve alerts that each require 45 minutes of investigation at 2am. If you track only who was on-call and not what that shift actually cost, you will consistently underestimate the burden on your senior engineers, who resolve things faster but get paged more often because they're trusted to handle anything.
Track actionable pages per shift per engineer. If one person consistently receives 3x the load of others, the rotation is broken regardless of how the calendar looks. The fix is alert hygiene first (delete alerts nobody acts on for 30 consecutive days), then rebalance the schedule based on load data, not headcount fairness.&lt;/li&gt;
&lt;li&gt;Putting engineers on independent on-call before shadow shifts.
The correct progression before anyone carries the pager alone: observer phase (receive all the same pages, take no action, watch how the primary responds), then reverse shadow (lead the response with an experienced engineer watching), then independent. Skipping this costs you higher MTTR on every incident that engineer handles alone, plus an experience that makes on-call feel dangerous rather than manageable.
Four to six weeks of partial senior engineer time upfront costs significantly less than the first major incident where an unprepared engineer makes it worse.&lt;/li&gt;
&lt;li&gt;Treating on-call as part of the job with no additional recognition.
An engineer paged three times outside business hours in a single week and expected to deliver full sprint capacity the following week is being asked to absorb a cost that isn't being acknowledged. This doesn't require complex compensation structures. Time in lieu for overnight pages, reduced sprint commitment after heavy on-call weeks, or explicit acknowledgment in performance reviews are all sufficient. The failure mode is pretending the cost doesn't exist.
If Opsgenie is still in your stack: end-of-support is April 5, 2027. If your runbooks and escalation policies live inside it, export everything now. The format doesn't migrate cleanly into alternatives.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>infrastructure</category>
      <category>career</category>
    </item>
    <item>
      <title>4 Cosmos validator mistakes that get you slashed at 3am</title>
      <dc:creator>Sonia</dc:creator>
      <pubDate>Wed, 22 Apr 2026 10:46:48 +0000</pubDate>
      <link>https://dev.to/soniarotglam/4-cosmos-validator-mistakes-that-get-you-slashed-at-3am-31lf</link>
      <guid>https://dev.to/soniarotglam/4-cosmos-validator-mistakes-that-get-you-slashed-at-3am-31lf</guid>
      <description>&lt;p&gt;Cosmos validator slashing is almost entirely preventable. The operators who get slashed aren't usually victims of sophisticated attacks — they're running without one or more of the protection layers that professional validators treat as non-negotiable. Here are the four mistakes that show up most often.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Confusing double-sign with downtime: they are not the same thing.
Most validators know about slashing in the abstract. Fewer understand that the two slashing conditions have completely different consequences:&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Downtime: Miss more than 500 of the last 10,000 blocks → 0.01% slash, 10-minute jail. You can unjail, rejoin the active set, and recover. Delegators will notice, but it's survivable.&lt;/p&gt;

&lt;p&gt;Double-signing: Sign two conflicting blocks at the same height → 5% slash, permanent jail. You cannot unjail after a double-sign. Your delegators lose 5% of their stake and you lose your validator permanently.&lt;/p&gt;

&lt;p&gt;The reason this distinction matters operationally: double-signing almost never happens from attacks. It happens when an operator runs a backup validator node without proper safeguards and both nodes come online simultaneously. The "I'll just spin up a second node as a failover" approach is exactly how you trigger a permanent 5% slash.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Using a backup node instead of TMKMS or Horcrux.
The correct answer to "what if my validator goes down?" is not a hot standby. It's key management.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;TMKMS (Tendermint Key Management System) extracts the signing key from your validator node into a separate process. It tracks which blocks have been signed and refuses to sign conflicting blocks; double-sign protection at the signing layer, not the infrastructure layer. If someone compromises your validator host, they don't get the key.&lt;/p&gt;

&lt;p&gt;Horcrux goes further: it splits your private key into shares using multi-party computation. You configure a threshold, say 2-of-3, so no single server holds the complete key. An attacker needs to compromise multiple servers simultaneously. And if one Horcrux node goes offline, the others still have quorum to sign, so you get high availability without the double-sign risk of running a hot standby.&lt;/p&gt;

&lt;p&gt;The setup difference: TMKMS is a single process that protects the key. Horcrux is a distributed cluster that eliminates the single point of failure entirely. For validators with significant stake, Horcrux is the standard.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Monitoring at the wrong threshold.
If your alert fires when you're jailed, it's too late.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The Cosmos Hub jails you at 500 missed blocks out of 10,000. Most people set their alert at 500. By the time the alert fires, you're already jailed and the 0.01% slash has happened.&lt;/p&gt;

&lt;p&gt;The right approach is two alerts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ValidatorMissedBlocks&lt;/span&gt;

  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;increase(cosmos_validator_missed_blocks_total[10m]) &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;

  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2m&lt;/span&gt;

  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;

    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;

&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ValidatorJailRisk&lt;/span&gt;

  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cosmos_validator_missed_blocks_total &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;400&lt;/span&gt;

  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1m&lt;/span&gt;

  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;

    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The warning gives you early signal. The critical fires at 400 - 80% of the jail threshold, when you still have time to intervene. The critical alert should go to PagerDuty, not just Slack. If it pages at 3am and nobody wakes up, you're jailed before anyone sees the message.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Not using Cosmovisor for chain upgrades.
Chain upgrades cause a disproportionate share of slashing events. The validator misses the upgrade block, falls behind, and gets jailed for downtime. Or the operator runs the old binary past the upgrade height and ends up on the wrong fork.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Cosmovisor solves this. It watches for upgrade governance proposals, downloads the new binary, and swaps it automatically at the correct block height, no manual intervention required.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;DAEMON_NAME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gaiad

&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;DAEMON_HOME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;/.gaia

&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;DAEMON_ALLOW_DOWNLOAD_BINARIES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true

export &lt;/span&gt;&lt;span class="nv"&gt;DAEMON_RESTART_AFTER_UPGRADE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true

&lt;/span&gt;cosmovisor run start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The alternative is manually monitoring governance, tracking upgrade heights, and being online at the exact moment the upgrade executes. In practice this means either a lot of alerting overhead or missing upgrades when the timing is inconvenient. Cosmovisor eliminates the category of risk entirely.&lt;/p&gt;

&lt;p&gt;The layer most people skip: runbooks.&lt;br&gt;
All the monitoring in the world doesn't help if the person who gets paged at 3am doesn't know what to do. The minimum runbook set for a Cosmos validator covers three scenarios: jailed for downtime, disk space critical, and sentry node offline. At 3am you don't want to be googling the unjail command or figuring out which log to check first.&lt;/p&gt;

&lt;p&gt;The full guide: including the complete TMKMS and Horcrux configurations, sentry node setup, and all seven protection layers, is at thegoodshell.com.&lt;/p&gt;

&lt;p&gt;Happy to answer questions in the comments if you are working through any of these.&lt;/p&gt;

</description>
      <category>cosmos</category>
      <category>blockchain</category>
      <category>infrastructure</category>
      <category>devops</category>
    </item>
    <item>
      <title>SRE vs DevOps: the sequencing mistake that burns most startups.</title>
      <dc:creator>Sonia</dc:creator>
      <pubDate>Mon, 20 Apr 2026 14:45:51 +0000</pubDate>
      <link>https://dev.to/soniarotglam/sre-vs-devops-the-sequencing-mistake-that-burns-most-startups-3i4l</link>
      <guid>https://dev.to/soniarotglam/sre-vs-devops-the-sequencing-mistake-that-burns-most-startups-3i4l</guid>
      <description>&lt;p&gt;Most startups approach the SRE vs DevOps question wrong. They ask "which is better?" when the real question is "which do I need right now and in what order?"&lt;/p&gt;

&lt;p&gt;After seeing this play out across a lot of engineering teams, the mistake is almost always the same: hiring the wrong role at the wrong stage. Here's what actually matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The one sentence that cuts through the noise.
&lt;/h2&gt;

&lt;p&gt;A DevOps engineer makes it easier to ship software. An SRE makes sure that software stays running once it's shipped.&lt;/p&gt;

&lt;p&gt;That's it. Every other difference: tooling, seniority, day-to-day work, follows from this. If your bottleneck is shipping, you have a DevOps problem. If your bottleneck is staying up, you have an SRE problem. The mistake is treating them as interchangeable or assuming you need both simultaneously from the start.&lt;/p&gt;

&lt;h2&gt;
  
  
  The sequencing trap most startups walk into.
&lt;/h2&gt;

&lt;p&gt;This is the one that costs real money: hiring an SRE before a DevOps foundation exists.&lt;/p&gt;

&lt;p&gt;An SRE without a functioning CI/CD pipeline is like hiring a Formula 1 engineer to fix a car that doesn't have wheels yet. The skills don't transfer down. An SRE wants to define SLOs, build error budgets, and design incident response processes. None of that is useful when your deployments still involve someone SSH-ing into a server and running a script manually.&lt;/p&gt;

&lt;p&gt;The correct sequencing is almost always:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;DevOps engineer to build the foundation: pipeline, IaC, basic monitoring.&lt;/li&gt;
&lt;li&gt;SRE practices once you have production traffic and the foundation is stable.&lt;/li&gt;
&lt;li&gt;Dedicated SRE hire when incident volume justifies it.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you skip step one, you'll waste step two.&lt;/p&gt;

&lt;h2&gt;
  
  
  The specific signals that tell you which one you need.
&lt;/h2&gt;

&lt;p&gt;"We have reliability problems" isn't specific enough. These are the actual triggers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You need a DevOps engineer when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deployments involve manual steps or specific people who need to be online.&lt;/li&gt;
&lt;li&gt;Onboarding a new engineer takes more than a day of environment setup.&lt;/li&gt;
&lt;li&gt;Your cloud costs are growing without obvious cause (IaC discipline prevents sprawl).&lt;/li&gt;
&lt;li&gt;Your CI/CD either doesn't exist or isn't trusted by the team.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;You need an SRE when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your MTTR (mean time to recovery) is consistently above two hours.&lt;/li&gt;
&lt;li&gt;You have users but no defined answer to "what's our acceptable downtime per month?".&lt;/li&gt;
&lt;li&gt;Your monitoring produces alerts but no context; engineers get paged and their first action is "let me figure out where to look".&lt;/li&gt;
&lt;li&gt;You're running validator nodes, RPC endpoints, or other infrastructure where availability is contractual or financial.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last point is worth calling out. For Web3 infrastructure: validators, nodes, RPC endpoints, the tolerance for downtime is near-zero and the consequences of an incident are immediate and financial. SRE thinking is not optional there; it's the baseline.&lt;/p&gt;

&lt;h2&gt;
  
  
  What SREs actually bring that DevOps engineers don't.
&lt;/h2&gt;

&lt;p&gt;The biggest conceptual gap between the roles is the error budget. An SRE defines an SLO (service level objective) say, 99.9% availability and then tracks how much of that budget has been consumed. When the budget is burned, they have the authority to stop feature shipping until reliability is restored.&lt;/p&gt;

&lt;p&gt;This is not a culture DevOps engineers typically build. A DevOps engineer optimises the delivery pipeline; they're not usually responsible for making the reliability vs. velocity tradeoff explicit. An SRE makes that tradeoff quantitative and enforced.&lt;/p&gt;

&lt;p&gt;The practical consequence: a great SRE will tell you your product's reliability strategy is wrong. A great DevOps engineer will make your current strategy execute more smoothly. Both are valuable, but they're solving different problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  When one person can do both.
&lt;/h2&gt;

&lt;p&gt;At early stage, yes and it's often the most efficient path. A senior engineer with both DevOps and SRE skills (sometimes called a Platform Engineer) can own the full stack: pipeline, monitoring, first SLOs, on-call rotation.&lt;/p&gt;

&lt;p&gt;This person is expensive and not easy to find. But for a Series A startup with one infrastructure hire, this is the profile that gives you the most coverage without over-hiring into specialisation you don't need yet.&lt;/p&gt;

&lt;p&gt;The roles diverge at scale. Platform teams own the tooling. SRE teams own reliability. That's a Series B+ problem.&lt;/p&gt;

&lt;p&gt;The full breakdown including how this applies to outstaffing and what it looks like to bring in the right skills on a project basis.&lt;/p&gt;

&lt;p&gt;Happy to answer questions in the comments if you are working through any of these.&lt;/p&gt;

</description>
      <category>sre</category>
      <category>devops</category>
      <category>infrastructure</category>
      <category>career</category>
    </item>
    <item>
      <title>5 GitHub Actions mistakes that will slow down (or break) your CI/CD pipeline.</title>
      <dc:creator>Sonia</dc:creator>
      <pubDate>Sat, 18 Apr 2026 11:33:43 +0000</pubDate>
      <link>https://dev.to/soniarotglam/5-github-actions-mistakes-that-will-slow-down-or-break-your-cicd-pipeline-3o4p</link>
      <guid>https://dev.to/soniarotglam/5-github-actions-mistakes-that-will-slow-down-or-break-your-cicd-pipeline-3o4p</guid>
      <description>&lt;p&gt;Most GitHub Actions tutorials get you to a green checkmark. Very few of them help you understand why your pipeline takes 8 minutes when it should take 2, or why your production deploy triggered from a feature branch PR at 11pm on a Friday.&lt;/p&gt;

&lt;p&gt;After working with a lot of engineering teams setting up CI/CD from scratch, these are the patterns that come up again and again.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. You're not caching dependencies and it's costing you minutes per run.
&lt;/h2&gt;

&lt;p&gt;The single fastest win in any GitHub Actions pipeline is dependency caching. Most people skip it because the pipeline "works." It does work. It's just running &lt;code&gt;npm install&lt;/code&gt; or &lt;code&gt;pip install&lt;/code&gt; from scratch on every single run.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Cache node modules&lt;/span&gt;
  &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/cache@v4&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;~/.npm&lt;/span&gt;
    &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}&lt;/span&gt;
    &lt;span class="na"&gt;restore-keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;${{ runner.os }}-node-&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;hashFiles&lt;/code&gt; key is the part that matters: the cache invalidates automatically when your lockfile changes, so you always get fresh deps when you actually update something. When it hits, you skip the install entirely. On a mid-size Node project, this typically cuts 2–4 minutes per run.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. You're pushing to Docker Hub when GHCR is sitting right there.
&lt;/h2&gt;

&lt;p&gt;GitHub Container Registry (GHCR) is built into GitHub and works with the &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; that already exists in every workflow. No extra secrets, no separate account, no rate limiting surprises.&lt;/p&gt;

&lt;p&gt;The catch that trips people up: you need to explicitly grant the &lt;code&gt;packages: write&lt;/code&gt; permission in your job definition. Without it, the push will fail with a misleading auth error.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;build-and-push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;
      &lt;span class="na"&gt;packages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;   &lt;span class="c1"&gt;# ← this line is required, not optional&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then authenticate like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Log in to GHCR&lt;/span&gt;
  &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker/login-action@v3&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;registry&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ghcr.io&lt;/span&gt;
    &lt;span class="na"&gt;username&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ github.actor }}&lt;/span&gt;
    &lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.GITHUB_TOKEN }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No secrets to rotate, no third-party dependency, and images are scoped to your repo automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Your production deploy is one accidental push away from triggering.
&lt;/h2&gt;

&lt;p&gt;If your workflow deploys to production on every push to &lt;code&gt;main&lt;/code&gt;, that's fine until someone force-pushes a fix, a bot commits a version bump, or a merge goes sideways.&lt;/p&gt;

&lt;p&gt;The pattern that solves this is a separate &lt;code&gt;deploy&lt;/code&gt; job with an &lt;code&gt;if&lt;/code&gt; condition and &lt;code&gt;needs&lt;/code&gt; chaining:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;deploy-production&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
  &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;lint-and-test&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;build-and-push&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;github.ref == 'refs/heads/main' &amp;amp;&amp;amp; github.event_name == 'push'&lt;/span&gt;
  &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;environment: production&lt;/code&gt; line is the one most people miss. If you've configured environment protection rules in GitHub (Settings → Environments), this gates the deploy behind required reviewers or a manual approval. It's free on public repos and included in Team plans for private ones.&lt;/p&gt;

&lt;p&gt;This means: automated deploys from &lt;code&gt;main&lt;/code&gt;, but with a human checkpoint before anything touches production.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. You're putting everything in &lt;code&gt;secrets&lt;/code&gt; when half of it should be in &lt;code&gt;vars&lt;/code&gt;.
&lt;/h2&gt;

&lt;p&gt;GitHub has two distinct places for pipeline configuration: &lt;strong&gt;Secrets&lt;/strong&gt; (encrypted, write-only, for credentials) and &lt;strong&gt;Variables&lt;/strong&gt; (plaintext, readable in UI, for config values).&lt;/p&gt;

&lt;p&gt;Most teams put everything in secrets. That means your &lt;code&gt;APP_ENV=production&lt;/code&gt; or &lt;code&gt;LOG_LEVEL=info&lt;/code&gt; is encrypted and invisible in the GitHub UI, which makes debugging and auditing unnecessarily painful.&lt;/p&gt;

&lt;p&gt;Variables are accessed with the &lt;code&gt;vars&lt;/code&gt; context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;APP_ENV&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ vars.APP_ENV }}&lt;/span&gt;
  &lt;span class="na"&gt;LOG_LEVEL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ vars.LOG_LEVEL }}&lt;/span&gt;
  &lt;span class="na"&gt;DATABASE_URL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.DATABASE_URL }}&lt;/span&gt;  &lt;span class="c1"&gt;# this one actually needs to be a secret&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Practical rule: if the value isn't a credential, a token, or a key, it belongs in &lt;code&gt;vars&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. You're pinning &lt;code&gt;actions/checkout@v4&lt;/code&gt; but running on &lt;code&gt;ubuntu-latest&lt;/code&gt;.
&lt;/h2&gt;

&lt;p&gt;This is a subtle one. Pinning action versions (e.g., &lt;code&gt;actions/checkout@v4&lt;/code&gt;) is good practice, it prevents upstream changes from breaking your pipeline without warning.&lt;/p&gt;

&lt;p&gt;But then running &lt;code&gt;runs-on: ubuntu-latest&lt;/code&gt; undoes some of that stability. &lt;code&gt;ubuntu-latest&lt;/code&gt; is an alias that GitHub updates periodically (currently &lt;code&gt;ubuntu-24.04&lt;/code&gt;, soon to rotate again), and those updates can change pre-installed tool versions, breaking pipelines that depend on system-level tools.&lt;/p&gt;

&lt;p&gt;If stability matters more than getting the latest runner features:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-22.04&lt;/span&gt;   &lt;span class="c1"&gt;# pinned, not latest&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll need to update it manually when the version reaches end-of-life, but you control when that happens, not GitHub's release schedule.&lt;/p&gt;




&lt;p&gt;These are the patterns that separate a "it works" pipeline from one that's actually reliable in production. The full step-by-step guide covering the complete pipeline structure, including Kubernetes deploy jobs, multi-environment promotion workflows, and secrets management at scale, is at &lt;a href="https://thegoodshell.com/github-actions-ci-cd-pipeline-tutorial/" rel="noopener noreferrer"&gt;thegoodshell.com&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Happy to answer questions in the comments if you are working through any of these.&lt;/p&gt;

</description>
      <category>githubactions</category>
      <category>cicd</category>
      <category>devops</category>
      <category>docker</category>
    </item>
    <item>
      <title>Beyond Meta Tags: The SRE’s Guide to Ranking in 2026</title>
      <dc:creator>Sonia</dc:creator>
      <pubDate>Tue, 14 Apr 2026 09:08:54 +0000</pubDate>
      <link>https://dev.to/soniarotglam/beyond-meta-tags-the-sres-guide-to-ranking-in-2026-3771</link>
      <guid>https://dev.to/soniarotglam/beyond-meta-tags-the-sres-guide-to-ranking-in-2026-3771</guid>
      <description>&lt;p&gt;We have been told for years that "Content is King." But in the high-stakes world of 2026, if your infrastructure is sluggish, your king is invisible.&lt;/p&gt;

&lt;p&gt;Working at The Good Shell, I’ve spent the last few months analyzing a recurring pattern among high-growth SaaS and Web3 startups: they have world-class frontend talent and aggressive SEO targets, yet their organic growth is stagnant. After auditing several stacks, the diagnosis is almost always the same. It’s not the keywords. It's the "Technical Debt" living in the infrastructure.&lt;/p&gt;

&lt;p&gt;If you are a developer or an SRE, this is why your infrastructure is the most powerful SEO tool you have.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. The Death of the "Static" SEO Mindset
&lt;/h2&gt;

&lt;p&gt;SEO used to be about what was on the page. Now, it’s about how that page is delivered. Google’s crawlers now operate with a strictly optimized "Crawl Budget."&lt;/p&gt;

&lt;p&gt;If your server takes 800ms to respond because your K8s ingress is misconfigured or your database queries are unindexed, Googlebot will simply leave. It’s not that your content isn't good—it’s that Google cannot afford the computational cost to wait for your server.&lt;br&gt;
The takeaway: A slow TTFB (Time to First Byte) is an immediate ranking penalty&lt;/p&gt;

&lt;h2&gt;
  
  
  2. The Hydration Trap in Modern Frameworks
&lt;/h2&gt;

&lt;p&gt;We all love Next.js, Remix, and Nuxt. But "Hydration" is often where SEO goes to die.&lt;/p&gt;

&lt;p&gt;When your infrastructure isn't tuned for Streaming SSR (Server-Side Rendering), the browser spends too much time executing JavaScript before the page becomes "Stable." This tanks your CLS (Cumulative Layout Shift) and LCP (Largest Contentful Paint).&lt;/p&gt;

&lt;p&gt;At The Good Shell, we recently helped a client move logic from the heavy main server to the Edge. By utilizing Edge Middleware to handle geo-location and A/B testing instead of doing it at the origin, we dropped the LCP by 1.2 seconds. That change alone moved them from the second page of Google to the top 3 spots for their main keywords.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Scaling Infrastructure vs. Search Stability
&lt;/h2&gt;

&lt;p&gt;One thing people rarely discuss is how infrastructure instability affects indexation.&lt;/p&gt;

&lt;p&gt;Imagine Googlebot crawls your site during a deployment. If your CI/CD pipeline doesn't handle Zero-Downtime Deployments correctly, or if your health checks are too slow to pull a failing pod out of the rotation, the crawler hits a 5xx error.&lt;/p&gt;

&lt;p&gt;To Google, a 5xx error isn't just a temporary glitch; it's a signal of unreliability. If it happens twice, your crawl frequency drops.&lt;/p&gt;

&lt;p&gt;Pro-tip: Use tools like Prometheus and Grafana not just to monitor "Uptime," but to monitor "Crawl Health." If you see an increase in 4xx/5xx errors coinciding with your deployment windows, your SEO is bleeding.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. The FinOps of SEO: Efficiency is a Feature
&lt;/h2&gt;

&lt;p&gt;There is a direct correlation between resource efficiency and performance. An over-provisioned, messy Kubernetes cluster is often a slow one.&lt;/p&gt;

&lt;p&gt;When we talk about FinOps (Cloud Cost Optimization), we aren't just saving money. We are removing the overhead that adds latency.&lt;/p&gt;

&lt;p&gt;Over-instrumentation: Too many sidecars in your service mesh can add micro-latencies that aggregate.&lt;/p&gt;

&lt;p&gt;Database Contention: Slow DB responses kill your TTFB.&lt;/p&gt;

&lt;p&gt;By cleaning up the architecture, you aren't just lowering the AWS bill; you are giving Googlebot a "green light" to crawl more of your site, faster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: The Bridge
&lt;/h2&gt;

&lt;p&gt;Technical SEO in 2026 is no longer about "tricking" a search engine. It’s about building a bridge between Marketing and SRE.&lt;/p&gt;

&lt;p&gt;If you want to stay competitive:&lt;/p&gt;

&lt;p&gt;Move logic to the Edge whenever possible.&lt;/p&gt;

&lt;p&gt;Audit your TTFB with the same intensity you audit your code.&lt;/p&gt;

&lt;p&gt;Bring SREs into the SEO conversation. Infrastructure isn't just a cost center; it's the foundation of your growth strategy. If the foundation is shaky, the skyscraper will never reach the clouds.&lt;/p&gt;

&lt;p&gt;I’m curious—how many of you have seen a direct correlation between infrastructure upgrades and organic traffic? Let’s discuss in the comments.&lt;/p&gt;

</description>
      <category>seo</category>
      <category>webdev</category>
      <category>performance</category>
      <category>sre</category>
    </item>
    <item>
      <title>Four things that will get your Cosmos validator slashed before you earn a single block reward</title>
      <dc:creator>Sonia</dc:creator>
      <pubDate>Tue, 07 Apr 2026 16:04:00 +0000</pubDate>
      <link>https://dev.to/soniarotglam/four-things-that-will-get-your-cosmos-validator-slashed-before-you-earn-a-single-block-reward-43ol</link>
      <guid>https://dev.to/soniarotglam/four-things-that-will-get-your-cosmos-validator-slashed-before-you-earn-a-single-block-reward-43ol</guid>
      <description>&lt;p&gt;The most dangerous moment in a Cosmos validator setup is not the on-chain registration. It is the ten minutes before it, when your &lt;code&gt;priv_validator_key.json&lt;/code&gt; is sitting unprotected on the validator host and you are about to run create-validator for the first time.&lt;br&gt;
Most guides walk you through the steps. Fewer of them tell you the specific things that will get you jailed or slashed if you skip them. These are four of them, from running validators on Cosmos Hub mainnet.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. NVMe is not optional, it is the difference between signing blocks and missing them
&lt;/h2&gt;

&lt;p&gt;Every guide lists "4TB SSD" as a hardware requirement. What most of them do not emphasize is that SATA SSDs and standard HDDs will cause I/O bottlenecks under load that manifest directly as missed blocks.&lt;br&gt;
The chain data on Cosmos Hub has grown significantly. Under normal operation, the node is continuously reading and writing to disk. During governance-triggered upgrades, that load spikes. If your disk cannot keep up, the node falls behind on block processing and starts missing signatures.&lt;br&gt;
NVMe specifically matters because the throughput difference between NVMe and SATA SSD is not marginal. It is the difference between a node that stays in sync under pressure and one that starts accumulating missed blocks at exactly the moment you can least afford it.&lt;br&gt;
RAM is the second one people underestimate. You need 64GB. The 32GB setups work fine in normal operation. They fail during upgrades, when memory spikes well above the normal operating baseline. Running out of memory at upgrade height is a jailing event.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Never set &lt;code&gt;DAEMON_ALLOW_DOWNLOAD_BINARIES=true&lt;/code&gt;
in Cosmovisor
This feels counterintuitive. Cosmovisor's auto-download feature sounds useful, you stage the upgrade in governance, and Cosmovisor downloads and swaps the binary automatically at the right block height.
The problem is what happens when the download fails. If the binary cannot be fetched at upgrade height, the node halts immediately. You are now racing to manually place the binary before the jailing threshold kicks in. On Cosmos Hub, that window is approximately 500 blocks, around 16 minutes at normal block times.
The safer pattern is to always pre-place upgrade binaries manually in the Cosmovisor upgrade directory before the governance proposal passes. You monitor the proposal, you compile and verify the binary, you put it in place. Cosmovisor finds it already there and does the swap cleanly.
&lt;code&gt;DAEMON_ALLOW_DOWNLOAD_BINARIES=false&lt;/code&gt; forces you into this pattern. It removes the failure mode where an auto-download kills your uptime at exactly the worst moment.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  3. The migration double-sign window is where most slashing events happen
&lt;/h3&gt;

&lt;p&gt;Double-sign slashing is permanent. It does not unjail. The tombstone is final.&lt;br&gt;
The scenario that causes it most often is not a configuration mistake during initial setup. It is a validator migration: moving from one host to another. The sequence that causes it:&lt;br&gt;
Old node is stopped. New node is started. Old node process was not actually stopped, or was restarted by a systemd restart policy, or a snapshot was used and the old node resumed from a state that did not reflect the stop.&lt;br&gt;
Both nodes are now signing with the same key. Double-sign event. Tombstone.&lt;br&gt;
The protection is simple but must be deliberate. When migrating: stop the old node, wait for a minimum of 10 confirmed blocks with no signing activity from that key, then start the new node. Never start the new node and then stop the old one. Never assume a stop command worked without verifying it.&lt;br&gt;
Setting &lt;code&gt;double_sign_check_height&lt;/code&gt; to a non-zero value in config.toml (10 to 20 blocks is standard) adds a second layer. The node will check recent block history before signing and refuse to sign if it detects a potential double-sign situation.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The sentry architecture is what keeps your validator IP off the public internet
&lt;/h3&gt;

&lt;p&gt;A validator without sentry nodes has its IP address visible in the P2P network. That is a DDoS target. Taking your validator offline long enough to miss 5% of blocks in a sliding window triggers jailing on Cosmos Hub.&lt;br&gt;
The sentry pattern is straightforward: two or more public-facing full nodes handle all external P2P connections. The validator node only connects to the sentries, never to the broader network. Its IP is never gossiped to peers.&lt;br&gt;
On the validator node, this means &lt;code&gt;pex = false&lt;/code&gt; and &lt;code&gt;persistent_peers&lt;/code&gt; pointing only to the sentry node IDs. On the sentry nodes, the validator node ID is listed in &lt;code&gt;private_peer_ids&lt;/code&gt; so its address is never shared with the network.&lt;br&gt;
Run sentries in at least two different geographic regions and on different providers. A DDoS that takes down one sentry is neutralised if the second is on a separate network.&lt;/p&gt;

&lt;p&gt;These four are the ones that cause the most production incidents on Cosmos validators: the hardware under-specification, the auto-download failure mode, the migration double-sign window, and the missing sentry layer. The rest of the setup, Go installation, gaiad build, state sync, TMKMS configuration, on-chain registration, is more mechanical.&lt;br&gt;
If you want the full setup with all the configuration files and commands from start to production, I wrote a detailed guide covering the complete process:&lt;br&gt;
&lt;a href="https://thegoodshell.com/cosmos-validator-setup/" rel="noopener noreferrer"&gt;Cosmos Validator Setup: The Ultimate Step-by-Step Guide for 2026&lt;/a&gt;&lt;br&gt;
Happy to answer questions in the comments if you are working through any of these.&lt;/p&gt;

</description>
      <category>blockchain</category>
      <category>cosmos</category>
      <category>devops</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Bootnode Security: 6 Essential Hardening Layers to Protect Your Web3 Network</title>
      <dc:creator>Sonia</dc:creator>
      <pubDate>Tue, 31 Mar 2026 08:38:05 +0000</pubDate>
      <link>https://dev.to/soniarotglam/bootnode-security-6-essential-hardening-layers-to-protect-your-web3-network-2jj4</link>
      <guid>https://dev.to/soniarotglam/bootnode-security-6-essential-hardening-layers-to-protect-your-web3-network-2jj4</guid>
      <description>&lt;p&gt;If you run a blockchain network private, permissioned, or public you have at least one bootnode. Almost nobody has hardened it properly.&lt;br&gt;
This is understandable. Bootnodes are infrastructure plumbing. They don't hold keys, they don't sign transactions. The assumption is that if a bootnode goes down, the network just loses peer discovery for a while. That assumption is wrong.&lt;br&gt;
Here's what a compromised bootnode actually enables: eclipse attacks. An attacker who controls your bootnode can feed newly joining nodes a list of attacker-controlled peers. Those nodes then sync from attacker-controlled infrastructure. For a DeFi protocol or validator, this creates conditions for double-spend attacks, transaction censorship, and consensus manipulation.&lt;br&gt;
A January 2026 paper on arXiv demonstrated the first practical end-to-end eclipse attack against post-Merge Ethereum execution layer nodes. This is not theoretical anymore.&lt;br&gt;
This guide covers 6 hardening layers that every production bootnode needs.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Real Threat Model
&lt;/h2&gt;

&lt;p&gt;Before writing a single firewall rule, understand what you're actually defending against:&lt;br&gt;
&lt;strong&gt;DDoS against the discovery port&lt;/strong&gt; bootnodes run UDP on port 30303 by default. UDP is stateless and easy to flood. A sustained attack takes down peer discovery for your entire network.&lt;br&gt;
&lt;strong&gt;Enode key compromise&lt;/strong&gt; the enode private key is your bootnode's identity. If an attacker steals it, they can impersonate your bootnode indefinitely with a node your network trusts.&lt;br&gt;
&lt;strong&gt;Eclipse attacks via discovery poisoning&lt;/strong&gt; — attackers inject malicious nodes into a target's peer database using passive discovery behavior. A bootnode without rate limiting amplifies this attack.&lt;br&gt;
*&lt;em&gt;Sybil attacks against the discovery table *&lt;/em&gt; bootnodes maintain a Kademlia-style table with 17 K-buckets, each holding up to 16 nodes. A Sybil attacker floods the table with controlled node IDs, crowding out legitimate peers. New nodes then get routed exclusively to attacker-controlled infrastructure.&lt;/p&gt;
&lt;h2&gt;
  
  
  Layer 1 - Host Hardening
&lt;/h2&gt;

&lt;p&gt;Run nothing else on the bootnode host. Minimal attack surface is not optional.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Disable unnecessary services&lt;/span&gt;
systemctl disable &lt;span class="nt"&gt;--now&lt;/span&gt; snapd cups avahi-daemon bluetooth

&lt;span class="c"&gt;# SSH hardening /etc/ssh/sshd_config&lt;/span&gt;
Port 22222
PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication &lt;span class="nb"&gt;yes
&lt;/span&gt;AllowUsers bootnode-admin
MaxAuthTries 3
X11Forwarding no
AllowTcpForwarding no
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Store the enode key on an encrypted volume:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cryptsetup luksFormat /dev/sdb
cryptsetup luksOpen /dev/sdb bootnode-keys
mkfs.ext4 /dev/mapper/bootnode-keys
mount /dev/mapper/bootnode-keys /mnt/bootnode-keys
&lt;span class="nb"&gt;chmod &lt;/span&gt;700 /mnt/bootnode-keys
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Layer 2 - Network Hardening
&lt;/h2&gt;

&lt;p&gt;This is where most bootnode security implementations fall apart. The default allows connections from any IP on any port. Fine for getting started. Not acceptable in production.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ufw default deny incoming
ufw default allow outgoing
ufw allow from &amp;lt;MANAGEMENT_IP&amp;gt; to any port 22222 proto tcp
ufw allow 30303/udp
ufw allow 30303/tcp
ufw &lt;span class="nb"&gt;enable&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Rate limit UDP with iptables&lt;/strong&gt; UFW alone doesn't rate-limit UDP:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;iptables &lt;span class="nt"&gt;-A&lt;/span&gt; INPUT &lt;span class="nt"&gt;-p&lt;/span&gt; udp &lt;span class="nt"&gt;--dport&lt;/span&gt; 30303 &lt;span class="nt"&gt;-m&lt;/span&gt; hashlimit &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--hashlimit-name&lt;/span&gt; udp-discovery &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--hashlimit-above&lt;/span&gt; 100/second &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--hashlimit-burst&lt;/span&gt; 200 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--hashlimit-mode&lt;/span&gt; srcip &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-j&lt;/span&gt; DROP
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For private/permissioned networks: restrict discovery to known IP ranges. There is no reason your bootnode should accept requests from arbitrary internet IPs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ufw allow from &amp;lt;NODE_IP_RANGE&amp;gt;/24 to any port 30303
ufw deny 30303
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This single change is the most impactful improvement for private networks and almost nobody does it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 3 - Enode Key Management
&lt;/h2&gt;

&lt;p&gt;Generate the key before starting the node. Never let the client auto-generate it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Generate and record the public key&lt;/span&gt;
bootnode &lt;span class="nt"&gt;-genkey&lt;/span&gt; /mnt/bootnode-keys/bootnode.key
bootnode &lt;span class="nt"&gt;-nodekey&lt;/span&gt; /mnt/bootnode-keys/bootnode.key &lt;span class="nt"&gt;-writeaddress&lt;/span&gt;

&lt;span class="c"&gt;# Secure permissions&lt;/span&gt;
&lt;span class="nb"&gt;chmod &lt;/span&gt;400 /mnt/bootnode-keys/bootnode.key
&lt;span class="nb"&gt;chown &lt;/span&gt;bootnode-service:bootnode-service /mnt/bootnode-keys/bootnode.key
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Systemd with sandboxing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight systemd"&gt;&lt;code&gt;&lt;span class="c"&gt;# /etc/systemd/system/bootnode.service&lt;/span&gt;
&lt;span class="k"&gt;[Service]&lt;/span&gt;
&lt;span class="nt"&gt;User&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;bootnode-service
&lt;span class="nt"&gt;ExecStart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;/usr/local/bin/bootnode &lt;span class="se"&gt;\
&lt;/span&gt;  -nodekey /mnt/bootnode-keys/bootnode.key &lt;span class="se"&gt;\
&lt;/span&gt;  -addr :30303
&lt;span class="nt"&gt;Restart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;always
&lt;span class="nt"&gt;NoNewPrivileges&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;true
&lt;span class="nt"&gt;PrivateTmp&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;true
&lt;span class="nt"&gt;ProtectSystem&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;strict
&lt;span class="nt"&gt;ReadWritePaths&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;/mnt/bootnode-keys
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Back up the key to offline storage immediately. The offline backup must be tested, not just created.&lt;/p&gt;

&lt;h2&gt;
  
  
  Layer 4 - Eclipse Attack Prevention
&lt;/h2&gt;

&lt;p&gt;Run at least 3 geographically distributed bootnodes across different cloud providers. An attacker needs to compromise all three simultaneously to control peer discovery.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Each node points to all bootnodes&lt;/span&gt;
geth &lt;span class="nt"&gt;--bootnodes&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"enode://&amp;lt;pubkey1&amp;gt;@&amp;lt;ip1&amp;gt;:30303,enode://&amp;lt;pubkey2&amp;gt;@&amp;lt;ip2&amp;gt;:30303,enode://&amp;lt;pubkey3&amp;gt;@&amp;lt;ip3&amp;gt;:30303"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each bootnode lists the others for faster discovery and resilience.&lt;br&gt;
Enable ENR/Discv5 where supportedit includes cryptographic verification that makes node impersonation significantly harder than legacy enode.&lt;/p&gt;
&lt;h2&gt;
  
  
  Layer 5 - Monitoring and Alerting
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Prometheus alerting rules&lt;/span&gt;
&lt;span class="na"&gt;groups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bootnode.security&lt;/span&gt;
    &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;BootnodeDown&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;up{job="bootnode"} == &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
        &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2m&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;BootnodePeerCountDrop&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;p2p_peers &amp;lt; &lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;
        &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
        &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Low&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;peer&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;count&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;possible&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;eclipse&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;or&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;DDoS"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;BootnodeUDPFlood&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rate(net_p2p_ingress_bytes_total[1m]) &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;50000000&lt;/span&gt;
        &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1m&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
        &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Possible&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;DDoS&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;on&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;discovery&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;port"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Layer 6 - Disaster Recovery and Key Rotation
&lt;/h2&gt;

&lt;p&gt;If the bootnode key is compromised, you need a pre-defined rotation procedure. Test it before you need it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Generate new key on new instance&lt;/span&gt;
bootnode &lt;span class="nt"&gt;-genkey&lt;/span&gt; /mnt/keys/bootnode-new.key
bootnode &lt;span class="nt"&gt;-nodekey&lt;/span&gt; /mnt/keys/bootnode-new.key &lt;span class="nt"&gt;-writeaddress&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; new-enode-pubkey.txt

&lt;span class="c"&gt;# Push new enode to all network nodes via Ansible&lt;/span&gt;
&lt;span class="c"&gt;# Bring up new bootnode&lt;/span&gt;
systemctl start bootnode-new

&lt;span class="c"&gt;# After confirming healthy take down compromised node&lt;/span&gt;
systemctl stop bootnode-old
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Multi-region deployment is non-negotiable for production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Region 1 (AWS eu-west-1) elastic IP&lt;/li&gt;
&lt;li&gt;Region 2 (Hetzner Helsinki) static IP&lt;/li&gt;
&lt;li&gt;Region 3 (GCP us-east1) static IP&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Different providers means a cloud-level outage doesn't take down your entire discovery layer.&lt;br&gt;
&lt;strong&gt;The Quick Checklist&lt;/strong&gt;&lt;br&gt;
Before deploying any production bootnode:&lt;br&gt;
&lt;strong&gt;Host:&lt;/strong&gt; dedicated host, SSH on non-standard port, key-only auth, disk encryption for keys, systemd sandboxing.&lt;br&gt;
&lt;strong&gt;Network:&lt;/strong&gt; UFW default deny, UDP rate limiting, SSH restricted to management IP, IP allowlisting for private networks.&lt;br&gt;
&lt;strong&gt;Enode key:&lt;/strong&gt; generated pre-start, encrypted volume, 400 permissions, offline backup tested, rotation runbook documented.&lt;br&gt;
&lt;strong&gt;Architecture:&lt;/strong&gt; minimum 3 bootnodes, cross-region, cross-provider, cross-referencing each other.&lt;br&gt;
&lt;strong&gt;Monitoring:&lt;/strong&gt; Prometheus scraping, alerts on down/peer drop/UDP flood/SSH failures.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;Bootnode security is the gap between "we have a network" and "we have a network that can't be trivially disrupted." Eclipse attacks against post-Merge Ethereum were demonstrated in peer-reviewed research in January 2026. The technical foundation has existed since 2018.&lt;br&gt;
None of this is exotic. Every protection here is standard Linux and networking practice applied to a blockchain-specific context. One solid day of work. The result is a bootnode that withstands DDoS, resists eclipse attempts, and survives key compromise with a clean rotation procedure.&lt;br&gt;
Questions? Drop them in the comments happy to go deeper on any of these layers.&lt;/p&gt;

&lt;p&gt;Originally published at &lt;a href="https://thegoodshell.com/" rel="noopener noreferrer"&gt;thegoodshell.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>web3</category>
      <category>devops</category>
      <category>blockchain</category>
      <category>security</category>
    </item>
  </channel>
</rss>
