<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Neeraja Khanapure</title>
    <description>The latest articles on DEV Community by Neeraja Khanapure (@neeraja_khanapure_4a33a5f).</description>
    <link>https://dev.to/neeraja_khanapure_4a33a5f</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3594314%2Feb8f6250-03b3-4528-af8b-17a146fe27c2.png</url>
      <title>DEV Community: Neeraja Khanapure</title>
      <link>https://dev.to/neeraja_khanapure_4a33a5f</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/neeraja_khanapure_4a33a5f"/>
    <language>en</language>
    <item>
      <title>Not in any textbook — learned this from a 3am page:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Tue, 28 Apr 2026 10:34:18 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/not-in-any-textbook-learned-this-from-a-3am-page-5edn</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/not-in-any-textbook-learned-this-from-a-3am-page-5edn</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Workflow (2026-04-28)
&lt;/h1&gt;

&lt;p&gt;Not in any textbook — learned this from a 3am page:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kubernetes rollouts: why 'pods are Ready' is the wrong promotion gate&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Readiness is a node-local signal. Production health is a global one. Most rollout pipelines conflate the two — and that's where incidents come from.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bad gate:                         Good gate:

Deploy ──▶ Pods Ready? ──▶ Done   Deploy ──▶ Pods Ready?
           (local signal)                    │
                                             ▼
                                    SLO window check
                                    (error rate + p95)
                                             │
                                    Pass ──▶ Promote
                                    Fail ──▶ Auto-rollback
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where it breaks:&lt;br&gt;
▸ 100% Ready pods while P95 latency spikes — bad cache warmup, noisy neighbor, DB connection saturation.&lt;br&gt;
▸ HPA reacts slower than a fast rollout — you ship overload before autoscaling catches up.&lt;br&gt;
▸ Canary stuck green because metrics lack the right labels/slices to isolate the failing segment.&lt;/p&gt;

&lt;p&gt;The rule I keep coming back to:&lt;br&gt;
→ Promote only when the canary holds your SLO slice (error rate + latency) for a fixed observation window. Otherwise: auto-rollback.&lt;/p&gt;

&lt;p&gt;How I sanity-check it:&lt;br&gt;
▸ Argo Rollouts or Flagger with Prometheus gates — error rate, latency percentiles, saturation.&lt;br&gt;
▸ Alert on canary-vs-baseline deltas, not absolute thresholds. Catches regressions that pass absolute checks.&lt;/p&gt;

&lt;p&gt;Operational maturity isn't about tools — it's about designing for the failure you haven't seen yet.&lt;/p&gt;

&lt;p&gt;Deep dive: &lt;a href="https://neeraja-portfolio-v1.vercel.app/workflows/kubernetes-rollouts-why-pods-are-ready-is-the-wrong-promotion-gate" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/workflows/kubernetes-rollouts-why-pods-are-ready-is-the-wrong-promotion-gate&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Strong opinions on this? Good. I want to hear the pushback.&lt;/p&gt;

&lt;h1&gt;
  
  
  kubernetes #reliability #devops #sre
&lt;/h1&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>kubernetes</category>
      <category>terraform</category>
    </item>
    <item>
      <title>A Friday systems thinking thread — worth sitting with:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Fri, 24 Apr 2026 10:12:05 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/a-friday-systems-thinking-thread-worth-sitting-with-4fbl</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/a-friday-systems-thinking-thread-worth-sitting-with-4fbl</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Insight (2026-04-24)
&lt;/h1&gt;

&lt;p&gt;A Friday systems thinking thread — worth sitting with:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observability debt is invisible until an incident makes it expensive&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can't debug what you can't slice. Teams add dashboards for years and still can't answer the two questions that matter most in an incident: which customers are affected, and which change caused it. The problem is almost never the tool — it's the label strategy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Observability debt accumulation:

Month 1:  Service A metrics added (no ownership labels)
Month 3:  Service B metrics added (different label schema)
Month 6:  Dashboard count: 47. Useful in incident: 3.
Month 9:  P0 incident. Can't isolate by customer/version.
          Engineer guesses. Guesses wrong. +45min MTTR.

Fix: Define label schema FIRST. Instrument second.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The non-obvious part:&lt;br&gt;
→ The teams that debug incidents fastest don't have more metrics — they have metrics that answer the right questions at the right cardinality. SLI-first instrumentation design is a force multiplier. Most teams instrument first and wonder why dashboards are noisy.&lt;/p&gt;

&lt;p&gt;My rule:&lt;br&gt;
→ Define your SLIs, then design labels that let you isolate by (service, env, version, customer tier) without exploding cardinality. Instrument last.&lt;/p&gt;

&lt;p&gt;Worth reading:&lt;br&gt;
▸ Brendan Gregg's USE Method + Google's RED Method for SLI-first design&lt;br&gt;
▸ Prometheus label best practices — cardinality anti-patterns (prometheus.io/docs)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://neeraja-portfolio-v1.vercel.app/insights/observability-debt-is-invisible-until-an-incident-makes-it-expensive" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/insights/observability-debt-is-invisible-until-an-incident-makes-it-expensive&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What's the version of this that your org gets wrong? Drop it below.&lt;/p&gt;

&lt;h1&gt;
  
  
  devops #sre #observability #platformengineering
&lt;/h1&gt;

</description>
      <category>observability</category>
      <category>sre</category>
      <category>devops</category>
      <category>platformengineering</category>
    </item>
    <item>
      <title>A hard-earned rule from incident retrospectives:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Tue, 21 Apr 2026 10:05:01 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/a-hard-earned-rule-from-incident-retrospectives-1e59</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/a-hard-earned-rule-from-incident-retrospectives-1e59</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Workflow (2026-04-21)
&lt;/h1&gt;

&lt;p&gt;A hard-earned rule from incident retrospectives:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitOps drift: the silent accumulation that makes clusters unmanageable&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;GitOps promises Git as the source of truth. The reality: every manual &lt;code&gt;kubectl&lt;/code&gt; during an incident is a lie you told your cluster and forgot to retract.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GitOps truth gap over time:

Week 1:  Git ══════════ Cluster  (clean)
Week 4:  Git ══════╌╌╌╌ Cluster  (2 manual patches)
Week 12: Git ════╌╌╌╌╌╌╌╌╌╌╌╌╌  (drift accumulates)
                         Cluster  (unknown state)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where it breaks:&lt;br&gt;
▸ Manual patches during incidents create cluster state Git doesn't know about — Argo/Flux will overwrite it silently.&lt;br&gt;
▸ Secrets managed outside GitOps (sealed-secrets, Vault agent) drift independently — invisible in sync status.&lt;br&gt;
▸ Multi-cluster setups multiply drift: each cluster diverges at its own pace once human intervention happens.&lt;/p&gt;

&lt;p&gt;The rule I keep coming back to:&lt;br&gt;
→ Treat every manual cluster change as a 5-minute loan. Commit it back to Git before the incident closes — or it's gone.&lt;/p&gt;

&lt;p&gt;How I sanity-check it:&lt;br&gt;
▸ Argo CD drift detection dashboard — surface out-of-sync resources before they become incident contributors.&lt;br&gt;
▸ Weekly diff job: live cluster state vs Git. Opens a PR for anything untracked. Makes drift visible before it's painful.&lt;/p&gt;

&lt;p&gt;The best platform teams I've seen measure success by how rarely product teams have to think about infrastructure.&lt;/p&gt;

&lt;p&gt;Deep dive: &lt;a href="https://neeraja-portfolio-v1.vercel.app/workflows/gitops-drift-the-silent-accumulation-that-makes-clusters-unmanageable" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/workflows/gitops-drift-the-silent-accumulation-that-makes-clusters-unmanageable&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Curious what guardrails you've built around this. Drop your pattern below.&lt;/p&gt;

&lt;h1&gt;
  
  
  gitops #kubernetes #devops #platformengineering
&lt;/h1&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>kubernetes</category>
      <category>terraform</category>
    </item>
    <item>
      <title>Something every senior engineer learns the expensive way:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Sun, 19 Apr 2026 16:53:12 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/something-every-senior-engineer-learns-the-expensive-way-41cd</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/something-every-senior-engineer-learns-the-expensive-way-41cd</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Workflow (2026-04-19)
&lt;/h1&gt;

&lt;p&gt;Something every senior engineer learns the expensive way:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terraform DAGs at scale: when the graph becomes the hazard&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Terraform's dependency graph is elegant at small scale. At 500+ resources across a mono-repo, it becomes the most dangerous part of your infrastructure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SAFE (small module):              DANGEROUS (at scale):

[vpc] ──▶ [subnet] ──▶ [ec2]     [shared-net] ──▶ [team-a-infra]
                                          │         [team-b-infra]
                                          │         [team-c-infra]
                                          │         [data-layer]
                                  One change → fan-out destroy/create
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where it breaks:&lt;br&gt;
▸ Implicit ordering assumptions survive until a refactor exposes them — usually as an unplanned destroy chain in prod.&lt;br&gt;
▸ Fan-out graphs make blast radius review near-impossible. 'What does this change affect?' has no fast answer.&lt;br&gt;
▸ &lt;code&gt;depends_on&lt;/code&gt; papering over bad module interfaces — it fixes the symptom and couples the modules permanently.&lt;/p&gt;

&lt;p&gt;The rule I keep coming back to:&lt;br&gt;
→ If a module needs &lt;code&gt;depends_on&lt;/code&gt; to be safe, the module boundary is wrong. Redesign the interface — don't paper over it.&lt;/p&gt;

&lt;p&gt;How I sanity-check it:&lt;br&gt;
▸ &lt;code&gt;terraform graph | dot -Tsvg &amp;gt; graph.svg&lt;/code&gt; — visualize fan-out and cycles before every major refactor.&lt;br&gt;
▸ Gate all applies with OPA/Conftest + mandatory human review on any planned destroy operations.&lt;/p&gt;

&lt;p&gt;The difference between a senior engineer and a principal is knowing which guardrails to build before you need them.&lt;/p&gt;

&lt;p&gt;Deep dive: &lt;a href="https://neeraja-portfolio-v1.vercel.app/workflows/terraform-dags-at-scale-when-the-graph-becomes-the-hazard" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/workflows/terraform-dags-at-scale-when-the-graph-becomes-the-hazard&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Curious what guardrails you've built around this. Drop your pattern below.&lt;/p&gt;

&lt;h1&gt;
  
  
  terraform #iac #devops #sre
&lt;/h1&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>kubernetes</category>
      <category>terraform</category>
    </item>
    <item>
      <title>Not in any textbook — learned this from a 3am page:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Sat, 18 Apr 2026 11:03:01 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/not-in-any-textbook-learned-this-from-a-3am-page-ipj</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/not-in-any-textbook-learned-this-from-a-3am-page-ipj</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Workflow (2026-04-18)
&lt;/h1&gt;

&lt;p&gt;Not in any textbook — learned this from a 3am page:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kubernetes cost spikes: the usual suspects and how to find them fast&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cloud bills spike in Kubernetes for the same reasons every time. None of them are visible in the default dashboards — and most of them are invisible until month-end.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cost leak sources (ranked by surprise factor):

1. Unset resource requests   → scheduler packs nodes → OOM → over-provision
2. Autoscaler scale-down lag → zombie nodes after traffic spike
3. Log pipelines w/o sample  → 40% of bill, 0% of dashboards
4. Idle namespaces           → dev clusters running 24/7
5. Spot interruption gaps    → fallback to on-demand, never reverted
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where it breaks:&lt;br&gt;
▸ Missing resource requests let the scheduler over-pack nodes — when pods OOM, you over-provision to compensate.&lt;br&gt;
▸ Cluster autoscaler adds nodes faster than it removes them. Spot interruptions leave zombie capacity for hours.&lt;br&gt;
▸ Logging agents (Fluentd/Filebeat) on every node with no sampling become the largest line item nobody owns.&lt;/p&gt;

&lt;p&gt;The rule I keep coming back to:&lt;br&gt;
→ Every workload needs requests AND limits. Review autoscaler scale-down thresholds monthly. Sample logs at source, not at the sink.&lt;/p&gt;

&lt;p&gt;How I sanity-check it:&lt;br&gt;
▸ Kubecost or OpenCost — per-namespace/team attribution. Without this, no one feels accountable for the number.&lt;br&gt;
▸ KEDA for event-driven workload scaling — eliminates idle replicas without sacrificing responsiveness.&lt;/p&gt;

&lt;p&gt;The best platform teams I've seen measure success by how rarely product teams have to think about infrastructure.&lt;/p&gt;

&lt;p&gt;Deep dive: &lt;a href="https://neeraja-portfolio-v1.vercel.app/workflows/kubernetes-cost-spikes-the-usual-suspects-and-how-to-find-them-fast" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/workflows/kubernetes-cost-spikes-the-usual-suspects-and-how-to-find-them-fast&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Strong opinions on this? Good. I want to hear the pushback.&lt;/p&gt;

&lt;h1&gt;
  
  
  kubernetes #finops #devops #platformengineering
&lt;/h1&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>kubernetes</category>
      <category>terraform</category>
    </item>
    <item>
      <title>This is what separates teams that scale from teams that survive:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Fri, 17 Apr 2026 10:00:17 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/this-is-what-separates-teams-that-scale-from-teams-that-survive-5af4</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/this-is-what-separates-teams-that-scale-from-teams-that-survive-5af4</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Insight (2026-04-17)
&lt;/h1&gt;

&lt;p&gt;This is what separates teams that scale from teams that survive:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dependency updates are reliability work, not maintenance work&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The orgs that get hit hardest by CVEs and supply chain incidents have one thing in common: they batch dependency updates into quarterly sprints. By then, the update is a 6-version jump, the changelog is 200 lines, and the 'quick upgrade' becomes a multi-day incident waiting to happen.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Batch update model (risky):   Continuous model (safe):

Q1: Skip                      Week 1: +2 deps (auto-PR)
Q2: Skip                      Week 2: +1 dep (auto-PR)
Q3: Skip                      Week 3: +3 deps (auto-PR)
Q4: "Upgrade sprint"          ...
    6-version jumps            Each PR: small diff, fast review
    Breaking changes           Rollback: one PR revert
    3-day debugging session    MTTR if it breaks: 10 min
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The non-obvious part:&lt;br&gt;
→ Automated dependency PRs (Renovate/Dependabot) with a 2-week merge SLA cost almost nothing — a 5-minute review per PR. Skipping them accumulates a compounding tax: more conflicts, larger blast radius, slower rollback. The math strongly favors continuous updates.&lt;/p&gt;

&lt;p&gt;My rule:&lt;br&gt;
→ Automate dependency PRs. Set a team policy: merge or explicitly defer within 2 weeks. Every skip is a known risk you're consciously accepting — treat it that way.&lt;/p&gt;

&lt;p&gt;Worth reading:&lt;br&gt;
▸ Renovate Bot — automerge strategies, scheduling, and grouping (docs.renovatebot.com)&lt;br&gt;
▸ SLSA framework — supply chain integrity levels and provenance (slsa.dev)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://neeraja-portfolio-v1.vercel.app/insights/dependency-updates-are-reliability-work-not-maintenance-work" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/insights/dependency-updates-are-reliability-work-not-maintenance-work&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're earlier in your career: bookmark this. It'll make more sense after your first real production incident.&lt;/p&gt;

&lt;h1&gt;
  
  
  devops #sre #observability #platformengineering
&lt;/h1&gt;

</description>
      <category>observability</category>
      <category>sre</category>
      <category>devops</category>
      <category>platformengineering</category>
    </item>
    <item>
      <title>A system design trap I've seen catch strong teams off guard:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Tue, 14 Apr 2026 09:59:40 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/a-system-design-trap-ive-seen-catch-strong-teams-off-guard-3ic4</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/a-system-design-trap-ive-seen-catch-strong-teams-off-guard-3ic4</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Workflow (2026-04-14)
&lt;/h1&gt;

&lt;p&gt;A system design trap I've seen catch strong teams off guard:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MLOps retraining in production: the guardrails matter more than the pipeline&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Wiring a retraining loop is a weekend project. Making it safe in production — data drift, silent label shifts, rollback semantics — is the actual engineering problem.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Risky loop:          Safe loop:

Data ──▶ Train       Data ──▶ Version ──▶ Train
  │        │           │                    │
  ▼        ▼           ▼                    ▼
 Prod    Deploy      Validate            Shadow eval
 (no gate)          lineage              │
                                    Pass ──▶ Canary
                                    Fail ──▶ Rollback
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where it breaks:&lt;br&gt;
▸ Better offline metrics / worse live KPIs — training/serving skew from feature drift you didn't catch.&lt;br&gt;
▸ Unversioned training data makes RCA impossible. You can't reproduce what trained the broken model.&lt;br&gt;
▸ No rollback path means every bad retrain is a production incident with a multi-hour recovery.&lt;/p&gt;

&lt;p&gt;The rule I keep coming back to:&lt;br&gt;
→ No model promotes without: versioned dataset lineage, shadow/canary evaluation against live traffic, and a tested one-click rollback.&lt;/p&gt;

&lt;p&gt;How I sanity-check it:&lt;br&gt;
▸ DVC + LakeFS for dataset versioning, MLflow/SageMaker Registry for model promotion gates.&lt;br&gt;
▸ Prometheus + Grafana for drift monitoring — alert on trend, not single-point anomalies.&lt;/p&gt;

&lt;p&gt;Operational maturity isn't about tools — it's about designing for the failure you haven't seen yet.&lt;/p&gt;

&lt;p&gt;Deep dive: &lt;a href="https://neeraja-portfolio-v1.vercel.app/workflows/mlops-retraining-in-production-the-guardrails-matter-more-than-the-pipeline" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/workflows/mlops-retraining-in-production-the-guardrails-matter-more-than-the-pipeline&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is where most runbooks stop — what's your next step after this?&lt;/p&gt;

&lt;h1&gt;
  
  
  mlops #aiops #platformengineering #sre
&lt;/h1&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>kubernetes</category>
      <category>terraform</category>
    </item>
    <item>
      <title>Not in any textbook — learned this from a 3am page:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Mon, 13 Apr 2026 01:12:51 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/not-in-any-textbook-learned-this-from-a-3am-page-6ni</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/not-in-any-textbook-learned-this-from-a-3am-page-6ni</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Workflow (2026-04-13)
&lt;/h1&gt;

&lt;p&gt;Not in any textbook — learned this from a 3am page:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On-call burnout is an alert design problem, not a schedule problem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every team I've seen fight burnout by rotating people faster. The actual fix is almost always the same: the alerts are wrong.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Alert quality spectrum:

Noisy ◀─────────────────────────── ▶ Actionable

[cpu &amp;gt; 80%]  [pod restart]  [error budget burn]  [customer impact]
     │              │               │                    │
 ignore me      maybe?         investigate!          wake me up
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where it breaks:&lt;br&gt;
▸ Alerts without a named owner and a runbook produce paralysis, not action — especially at 2am.&lt;br&gt;
▸ Flapping alerts are the fastest path to alert blindness — engineers learn to dismiss pages before reading them.&lt;br&gt;
▸ Cause-based alerts (disk full) and symptom-based alerts (latency spike) need different urgency and routing.&lt;/p&gt;

&lt;p&gt;The rule I keep coming back to:&lt;br&gt;
→ Before any alert ships: Who acts on it? What do they do? What's the cost of 30 minutes of inaction? If you can't answer all three, it's not ready.&lt;/p&gt;

&lt;p&gt;How I sanity-check it:&lt;br&gt;
▸ Weekly alert review ritual: tag every last-week page as actionable / noisy / redundant. Kill the bottom two categories.&lt;br&gt;
▸ PagerDuty/OpsGenie grouping + escalation policies — reduce interrupt rate without hiding real incidents.&lt;/p&gt;

&lt;p&gt;Reliability is a product feature. The engineers who treat it that way are the ones who get asked into the room.&lt;/p&gt;

&lt;p&gt;Deep dive: &lt;a href="https://neeraja-portfolio-v1.vercel.app/workflows/on-call-burnout-is-an-alert-design-problem-not-a-schedule-problem" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/workflows/on-call-burnout-is-an-alert-design-problem-not-a-schedule-problem&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Hiring managers: engineers who think about this in interviews are the ones worth calling back.&lt;/p&gt;

&lt;h1&gt;
  
  
  sre #observability #reliability #devops
&lt;/h1&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>kubernetes</category>
      <category>terraform</category>
    </item>
    <item>
      <title>Something I wish someone had told me five years earlier:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Fri, 10 Apr 2026 09:52:55 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/something-i-wish-someone-had-told-me-five-years-earlier-o1f</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/something-i-wish-someone-had-told-me-five-years-earlier-o1f</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Insight (2026-04-10)
&lt;/h1&gt;

&lt;p&gt;Something I wish someone had told me five years earlier:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Distributed tracing: the gap between having it and using it in incidents&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most orgs instrument distributed traces correctly and then debug incidents with grep. The investment in tracing pays off only when your debugging workflow changes — when you start from a trace ID instead of a log query. That's a culture change, not a tooling change.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Current state (most orgs):      Target state:

Incident fires                  Incident fires
     │                               │
Grep logs ──▶ Guess service ──▶  Pull trace ID from alert
     │                               │
More grep ──▶ Find error      Trace shows full request path
     │                               │
Escalate ──▶ More engineers   Latency waterfall identifies
     │                         bottleneck in 3 minutes
MTTR: 90 min                  MTTR: 15 min
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The non-obvious part:&lt;br&gt;
→ Traces don't reduce MTTR on their own — runbooks that start from trace IDs do. The highest-leverage thing you can do after instrumenting is to rewrite your top 5 incident runbooks to start with 'get the trace ID from the alert, open it in Jaeger/Tempo, find the slowest span.' Engineers follow runbooks under pressure.&lt;/p&gt;

&lt;p&gt;My rule:&lt;br&gt;
→ Instrument your 3 highest-traffic endpoints first. Then rewrite one runbook to start from a trace ID. Measure incident time-to-hypothesis before and after.&lt;/p&gt;

&lt;p&gt;Worth reading:&lt;br&gt;
▸ OpenTelemetry instrumentation guides — language SDKs (opentelemetry.io/docs)&lt;br&gt;
▸ Grafana Tempo + Loki correlation — trace-to-log workflow without leaving the dashboard&lt;/p&gt;

&lt;p&gt;&lt;a href="https://neeraja-portfolio-v1.vercel.app/insights/distributed-tracing-the-gap-between-having-it-and-using-it-in-incidents" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/insights/distributed-tracing-the-gap-between-having-it-and-using-it-in-incidents&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're earlier in your career: bookmark this. It'll make more sense after your first real production incident.&lt;/p&gt;

&lt;h1&gt;
  
  
  devops #sre #observability #platformengineering
&lt;/h1&gt;

</description>
      <category>observability</category>
      <category>sre</category>
      <category>devops</category>
      <category>platformengineering</category>
    </item>
    <item>
      <title>A hard-earned rule from incident retrospectives:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Tue, 07 Apr 2026 09:48:10 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/a-hard-earned-rule-from-incident-retrospectives-40jp</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/a-hard-earned-rule-from-incident-retrospectives-40jp</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Workflow (2026-04-07)
&lt;/h1&gt;

&lt;p&gt;A hard-earned rule from incident retrospectives:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Incident RCA without a data-backed timeline is just a story you told yourself&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most post-mortems produce lessons that don't stick. The root cause is almost always the same: the timeline was built from memory, not from data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Memory-based timeline:     Data-backed timeline:

T+0  "Deploy happened"     T+0:00  Deploy (Argo event)
T+?  "Errors started"      T+0:07  Error rate +0.3% (Prometheus)
T+?  "Someone noticed"     T+0:12  P95 latency 340ms→2.1s (trace)
T+?  "We rolled back"      T+0:19  Alert fired (PD)
                           T+0:31  Rollback complete (Argo)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where it breaks:&lt;br&gt;
▸ Log timestamps across services diverge by seconds without NTP — your timeline is wrong before you begin.&lt;br&gt;
▸ Correlation between a deploy event and a metric spike gets missed when dashboards lack deployment markers.&lt;br&gt;
▸ Contributing factors vanish from the narrative because they're hard to prove — and the same incident repeats.&lt;/p&gt;

&lt;p&gt;The rule I keep coming back to:&lt;br&gt;
→ Build the timeline from data only before the RCA meeting begins. If you can't source an event, mark it 'unverified' — not assumed.&lt;/p&gt;

&lt;p&gt;How I sanity-check it:&lt;br&gt;
▸ OpenTelemetry trace IDs as the timeline spine — they cross service boundaries with sub-millisecond precision.&lt;br&gt;
▸ Grafana annotations on every deploy, config change, and scaling event — visible on every dashboard automatically.&lt;/p&gt;

&lt;p&gt;Systems that are hard to debug were designed without the debugger in mind. Build observability in, not on.&lt;/p&gt;

&lt;p&gt;Deep dive: &lt;a href="https://neeraja-portfolio-v1.vercel.app/workflows/incident-rca-without-a-data-backed-timeline-is-just-a-story-you-told-yourself" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/workflows/incident-rca-without-a-data-backed-timeline-is-just-a-story-you-told-yourself&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is where most runbooks stop — what's your next step after this?&lt;/p&gt;

&lt;h1&gt;
  
  
  sre #reliability #observability #devops
&lt;/h1&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>kubernetes</category>
      <category>terraform</category>
    </item>
    <item>
      <title>Something I keep explaining in architecture reviews:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Sun, 05 Apr 2026 16:55:28 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/something-i-keep-explaining-in-architecture-reviews-43n2</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/something-i-keep-explaining-in-architecture-reviews-43n2</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Workflow (2026-04-05)
&lt;/h1&gt;

&lt;p&gt;Something I keep explaining in architecture reviews:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Secrets management: designing for rotation, not just storage&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most orgs solve 'where do we store secrets securely.' The teams that get paged at 2am are the ones who never solved 'how do we rotate them without downtime.'&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Storage-only design:        Rotation-aware design:

Secret ──▶ Vault            Secret ──▶ Vault ──▶ Agent Injector
              │                                        │
         Pod (env var)                           Pod (file mount)
              │                                        │
         Restart to           Auto-reload ◀────── Lease renewer
         get new value        (zero downtime)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where it breaks:&lt;br&gt;
▸ Secrets as env vars require pod restarts on rotation — making rotation a deployment event with blast radius.&lt;br&gt;
▸ Vault leases expiring in long-running jobs produce auth errors that look like app bugs, not infra failures.&lt;br&gt;
▸ Secret sprawl across namespaces means rotation happens in 12 places — and one always gets missed.&lt;/p&gt;

&lt;p&gt;The rule I keep coming back to:&lt;br&gt;
→ Design rotation before you design storage. If you can't rotate a secret in under 10 minutes with no downtime, the design isn't production-ready.&lt;/p&gt;

&lt;p&gt;How I sanity-check it:&lt;br&gt;
▸ Vault Agent Injector or External Secrets Operator — decouple secret delivery from pod lifecycle.&lt;br&gt;
▸ Monthly secret access log audit — stale consumers are how you discover forgotten service accounts before attackers do.&lt;/p&gt;

&lt;p&gt;Reliability is a product feature. The engineers who treat it that way are the ones who get asked into the room.&lt;/p&gt;

&lt;p&gt;Deep dive: &lt;a href="https://neeraja-portfolio-v1.vercel.app/workflows/secrets-management-designing-for-rotation-not-just-storage" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/workflows/secrets-management-designing-for-rotation-not-just-storage&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If this triggered a war story, I'd genuinely love to hear it.&lt;/p&gt;

&lt;h1&gt;
  
  
  security #devops #kubernetes #sre
&lt;/h1&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>kubernetes</category>
      <category>terraform</category>
    </item>
    <item>
      <title>Something I wish someone had told me five years earlier:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Fri, 03 Apr 2026 09:39:52 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/something-i-wish-someone-had-told-me-five-years-earlier-4lo7</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/something-i-wish-someone-had-told-me-five-years-earlier-4lo7</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Insight (2026-04-03)
&lt;/h1&gt;

&lt;p&gt;Something I wish someone had told me five years earlier:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Zero-downtime deployments: what 'zero' actually requires most teams don't have&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most teams say they do zero-downtime deploys and mean 'we haven't gotten a complaint in a while.' Actually measuring it reveals the truth: connection drops, in-flight request failures, and cache invalidation spikes during rollouts that nobody's tracking because nobody defined what zero means.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;What 'zero downtime' actually requires:

✓ Health checks reflect REAL readiness (not just 'process started')
✓ Graceful shutdown drains in-flight requests (SIGTERM handling)
✓ Connection draining at the load balancer (not just the pod)
✓ Rollback faster than the deploy (&amp;lt; 5 min, automated)
✓ SLI measurement during the rollout window (not just after)

Missing any one of these = not zero downtime. Just unmonitored downtime.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The non-obvious part:&lt;br&gt;
→ The most common failure mode is passing health checks before the app is actually ready — DB connections not pooled, caches not warm, background workers not started. The pod is 'Ready' and the app is still initializing. Users see errors. Nobody's dashboard shows it because nobody's measuring error rate during the rollout window.&lt;/p&gt;

&lt;p&gt;My rule:&lt;br&gt;
→ Define 'zero downtime' with a measurable SLI: error rate &amp;lt; 0.1% during any 5-minute deploy window. Validate this in staging before calling it done. Measure it in production on every release.&lt;/p&gt;

&lt;p&gt;Worth reading:&lt;br&gt;
▸ Kubernetes deployment strategies — rolling, blue/green, canary with traffic splitting&lt;br&gt;
▸ AWS ALB / GCP Cloud Load Balancing — connection draining configuration and health check tuning&lt;/p&gt;

&lt;p&gt;&lt;a href="https://neeraja-portfolio-v1.vercel.app/insights/zero-downtime-deployments-what-zero-actually-requires-most-teams-dont-have" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/insights/zero-downtime-deployments-what-zero-actually-requires-most-teams-dont-have&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're a manager reading this — it's worth asking your team where they are on this.&lt;/p&gt;

&lt;h1&gt;
  
  
  devops #sre #observability #platformengineering
&lt;/h1&gt;

</description>
      <category>observability</category>
      <category>sre</category>
      <category>devops</category>
      <category>platformengineering</category>
    </item>
  </channel>
</rss>
