<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Neeraja Khanapure</title>
    <description>The latest articles on DEV Community by Neeraja Khanapure (@neeraja_khanapure_4a33a5f).</description>
    <link>https://dev.to/neeraja_khanapure_4a33a5f</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3594314%2Feb8f6250-03b3-4528-af8b-17a146fe27c2.png</url>
      <title>DEV Community: Neeraja Khanapure</title>
      <link>https://dev.to/neeraja_khanapure_4a33a5f</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/neeraja_khanapure_4a33a5f"/>
    <language>en</language>
    <item>
      <title>A hard-earned rule from incident retrospectives:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Tue, 19 May 2026 11:42:11 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/a-hard-earned-rule-from-incident-retrospectives-1ln6</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/a-hard-earned-rule-from-incident-retrospectives-1ln6</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Workflow (2026-05-19)
&lt;/h1&gt;

&lt;p&gt;A hard-earned rule from incident retrospectives:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Incident RCA without a data-backed timeline is just a story you told yourself&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most post-mortems produce lessons that don't stick. The root cause is almost always the same: the timeline was built from memory, not from data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Memory-based timeline:     Data-backed timeline:

T+0  "Deploy happened"     T+0:00  Deploy (Argo event)
T+?  "Errors started"      T+0:07  Error rate +0.3% (Prometheus)
T+?  "Someone noticed"     T+0:12  P95 latency 340ms→2.1s (trace)
T+?  "We rolled back"      T+0:19  Alert fired (PD)
                           T+0:31  Rollback complete (Argo)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where it breaks:&lt;br&gt;
▸ Log timestamps across services diverge by seconds without NTP — your timeline is wrong before you begin.&lt;br&gt;
▸ Correlation between a deploy event and a metric spike gets missed when dashboards lack deployment markers.&lt;br&gt;
▸ Contributing factors vanish from the narrative because they're hard to prove — and the same incident repeats.&lt;/p&gt;

&lt;p&gt;The rule I keep coming back to:&lt;br&gt;
→ Build the timeline from data only before the RCA meeting begins. If you can't source an event, mark it 'unverified' — not assumed.&lt;/p&gt;

&lt;p&gt;How I sanity-check it:&lt;br&gt;
▸ OpenTelemetry trace IDs as the timeline spine — they cross service boundaries with sub-millisecond precision.&lt;br&gt;
▸ Grafana annotations on every deploy, config change, and scaling event — visible on every dashboard automatically.&lt;/p&gt;

&lt;p&gt;Systems that are hard to debug were designed without the debugger in mind. Build observability in, not on.&lt;/p&gt;

&lt;p&gt;Deep dive: &lt;a href="https://neeraja-portfolio-v1.vercel.app/workflows/incident-rca-without-a-data-backed-timeline-is-just-a-story-you-told-yourself" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/workflows/incident-rca-without-a-data-backed-timeline-is-just-a-story-you-told-yourself&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is where most runbooks stop — what's your next step after this?&lt;/p&gt;

&lt;h1&gt;
  
  
  sre #reliability #observability #devops
&lt;/h1&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>kubernetes</category>
      <category>terraform</category>
    </item>
    <item>
      <title>Something I wish someone had told me five years earlier:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Fri, 15 May 2026 11:06:41 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/something-i-wish-someone-had-told-me-five-years-earlier-6do</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/something-i-wish-someone-had-told-me-five-years-earlier-6do</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Insight (2026-05-15)
&lt;/h1&gt;

&lt;p&gt;Something I wish someone had told me five years earlier:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AIOps is a reasoning accelerator, not an auto-remediation system&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The orgs getting real value from AIOps aren't the ones automating remediation — they're the ones using AI to compress the signal-to-hypothesis gap. The hard part of incidents isn't fixing things. It's knowing what to fix, in what order, with what confidence.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Where AI adds real value in incidents:

Alert storm (200 events)
       │
       ▼
  [AI correlation]  ──▶  3 likely root causes (ranked)
       │
       ▼
  [AI runbook retrieval]  ──▶  Relevant steps surfaced
       │
       ▼
  Human validates hypothesis  ──▶  Takes action
       │
  Auto-remediation only here ──▶  After human confirms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The non-obvious part:&lt;br&gt;
→ AI that remediates without evidence is hallucination-as-a-service. The models that earn trust are the ones that show their work: here's the metric spike, here's the correlated trace, here's the similar past incident. Evidence first, action second.&lt;/p&gt;

&lt;p&gt;My rule:&lt;br&gt;
→ Use AI for hypothesis ranking and runbook retrieval. Keep remediation behind explicit human approval. Trust is earned incrementally — don't give it away in the initial design.&lt;/p&gt;

&lt;p&gt;Worth reading:&lt;br&gt;
▸ OpenTelemetry — consistent signal foundation for AI correlation (opentelemetry.io)&lt;br&gt;
▸ Blameless RCA templates — 'did AI help or mislead?' as a standard post-incident question&lt;/p&gt;

&lt;p&gt;&lt;a href="https://neeraja-portfolio-v1.vercel.app/insights/aiops-is-a-reasoning-accelerator-not-an-auto-remediation-system" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/insights/aiops-is-a-reasoning-accelerator-not-an-auto-remediation-system&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you disagree, I want to hear it. The best version of this thinking comes from pushback.&lt;/p&gt;

&lt;h1&gt;
  
  
  devops #sre #observability #platformengineering
&lt;/h1&gt;

</description>
      <category>observability</category>
      <category>sre</category>
      <category>devops</category>
      <category>platformengineering</category>
    </item>
    <item>
      <title>This pattern has saved production twice in the last year:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Tue, 12 May 2026 11:02:14 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/this-pattern-has-saved-production-twice-in-the-last-year-33m6</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/this-pattern-has-saved-production-twice-in-the-last-year-33m6</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Workflow (2026-05-12)
&lt;/h1&gt;

&lt;p&gt;This pattern has saved production twice in the last year:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Service mesh adoption: the operational debt lands before the value does&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Service meshes promise mTLS, traffic splitting, and deep observability. What arrives first is a new category of production failures your team has never debugged before.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Adoption curve reality:

Value
  │                              ╱ mTLS + traffic control
  │                         ╱
  │              ╱╲  complexity trough
  │         ╱╲╱
  │    ╱╲╱   ← sidecar failures, upgrade pain
  │╱
  └──────────────────────────────▶ Time
     Week 1     Month 3     Month 9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where it breaks:&lt;br&gt;
▸ Sidecar injection failures look like app bugs — hours spent debugging the wrong layer.&lt;br&gt;
▸ mTLS policy rollout in a live cluster requires namespace-by-namespace phasing — one mistake stops traffic.&lt;br&gt;
▸ Mesh upgrades require coordinated sidecar restarts across the cluster — on large deployments, that's everything.&lt;/p&gt;

&lt;p&gt;The rule I keep coming back to:&lt;br&gt;
→ Start mesh in observability-only mode (no policy enforcement). Prove value in one namespace first. Earn the rollout, don't mandate it.&lt;/p&gt;

&lt;p&gt;How I sanity-check it:&lt;br&gt;
▸ Linkerd for latency-sensitive workloads — lower resource overhead than Istio's Envoy per sidecar.&lt;br&gt;
▸ Namespace-level feature flags for mesh policy — lets you roll back one team without affecting others.&lt;/p&gt;

&lt;p&gt;The difference between a senior engineer and a principal is knowing which guardrails to build before you need them.&lt;/p&gt;

&lt;p&gt;Deep dive: &lt;a href="https://neeraja-portfolio-v1.vercel.app/workflows/service-mesh-adoption-the-operational-debt-lands-before-the-value-does" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/workflows/service-mesh-adoption-the-operational-debt-lands-before-the-value-does&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If this triggered a war story, I'd genuinely love to hear it.&lt;/p&gt;

&lt;h1&gt;
  
  
  kubernetes #devops #sre #platformengineering
&lt;/h1&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>kubernetes</category>
      <category>terraform</category>
    </item>
    <item>
      <title>One insight that changed how I design systems:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Fri, 08 May 2026 10:09:44 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/one-insight-that-changed-how-i-design-systems-4pk5</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/one-insight-that-changed-how-i-design-systems-4pk5</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Insight (2026-05-08)
&lt;/h1&gt;

&lt;p&gt;One insight that changed how I design systems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Feature flags are ops infrastructure, not a product team tool&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most platform teams treat feature flags as a product/A-B testing concern and miss the most operationally valuable use case: instant rollback for any backend change, without a redeployment. The teams with the lowest incident MTTR almost all have the same secret — they can disable any code path in 30 seconds.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Feature flag as ops tool:

New backend change ships ──▶ [flag: new_auth_flow = true]
                                        │
                              Incident detected (T+12min)
                                        │
                              [flag: new_auth_flow = false]
                                        │
                              Recovery complete (T+13min)

vs. traditional rollback:    Recovery complete (T+75min)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The non-obvious part:&lt;br&gt;
→ A feature flag that can disable a code path in 30 seconds is worth more than any runbook during an active incident. It converts a deployment rollback — with its pipeline, merge, and propagation delays — into a config change. Most platform teams own the infra for this and never tell product teams it exists.&lt;/p&gt;

&lt;p&gt;My rule:&lt;br&gt;
→ Every significant backend change ships behind a flag for at least 2 weeks post-deploy. Flags are cheap. Incidents are not. Make flags a deployment requirement, not an option.&lt;/p&gt;

&lt;p&gt;Worth reading:&lt;br&gt;
▸ Martin Fowler — Feature Toggles pattern (martinfowler.com/articles/feature-toggles.html)&lt;br&gt;
▸ LaunchDarkly / Unleash — flag lifecycle management and audit logging best practices&lt;/p&gt;

&lt;p&gt;&lt;a href="https://neeraja-portfolio-v1.vercel.app/insights/feature-flags-are-ops-infrastructure-not-a-product-team-tool" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/insights/feature-flags-are-ops-infrastructure-not-a-product-team-tool&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If this resonated, share it with one person on your team who'd benefit. That's how good thinking spreads.&lt;/p&gt;

&lt;h1&gt;
  
  
  devops #sre #observability #platformengineering
&lt;/h1&gt;

</description>
      <category>observability</category>
      <category>sre</category>
      <category>devops</category>
      <category>platformengineering</category>
    </item>
    <item>
      <title>Not in any textbook — learned this from a 3am page:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Thu, 07 May 2026 02:44:22 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/not-in-any-textbook-learned-this-from-a-3am-page-43j9</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/not-in-any-textbook-learned-this-from-a-3am-page-43j9</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Workflow (2026-05-07)
&lt;/h1&gt;

&lt;p&gt;Not in any textbook — learned this from a 3am page:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On-call burnout is an alert design problem, not a schedule problem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every team I've seen fight burnout by rotating people faster. The actual fix is almost always the same: the alerts are wrong.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Alert quality spectrum:

Noisy ◀─────────────────────────── ▶ Actionable

[cpu &amp;gt; 80%]  [pod restart]  [error budget burn]  [customer impact]
     │              │               │                    │
 ignore me      maybe?         investigate!          wake me up
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where it breaks:&lt;br&gt;
▸ Alerts without a named owner and a runbook produce paralysis, not action — especially at 2am.&lt;br&gt;
▸ Flapping alerts are the fastest path to alert blindness — engineers learn to dismiss pages before reading them.&lt;br&gt;
▸ Cause-based alerts (disk full) and symptom-based alerts (latency spike) need different urgency and routing.&lt;/p&gt;

&lt;p&gt;The rule I keep coming back to:&lt;br&gt;
→ Before any alert ships: Who acts on it? What do they do? What's the cost of 30 minutes of inaction? If you can't answer all three, it's not ready.&lt;/p&gt;

&lt;p&gt;How I sanity-check it:&lt;br&gt;
▸ Weekly alert review ritual: tag every last-week page as actionable / noisy / redundant. Kill the bottom two categories.&lt;br&gt;
▸ PagerDuty/OpsGenie grouping + escalation policies — reduce interrupt rate without hiding real incidents.&lt;/p&gt;

&lt;p&gt;Reliability is a product feature. The engineers who treat it that way are the ones who get asked into the room.&lt;/p&gt;

&lt;p&gt;Deep dive: &lt;a href="https://neeraja-portfolio-v1.vercel.app/workflows/on-call-burnout-is-an-alert-design-problem-not-a-schedule-problem" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/workflows/on-call-burnout-is-an-alert-design-problem-not-a-schedule-problem&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Hiring managers: engineers who think about this in interviews are the ones worth calling back.&lt;/p&gt;

&lt;h1&gt;
  
  
  sre #observability #reliability #devops
&lt;/h1&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>kubernetes</category>
      <category>terraform</category>
    </item>
    <item>
      <title>Something I keep explaining in architecture reviews:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Tue, 05 May 2026 10:23:50 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/something-i-keep-explaining-in-architecture-reviews-3644</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/something-i-keep-explaining-in-architecture-reviews-3644</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Workflow (2026-05-05)
&lt;/h1&gt;

&lt;p&gt;Something I keep explaining in architecture reviews:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Secrets management: designing for rotation, not just storage&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most orgs solve 'where do we store secrets securely.' The teams that get paged at 2am are the ones who never solved 'how do we rotate them without downtime.'&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Storage-only design:        Rotation-aware design:

Secret ──▶ Vault            Secret ──▶ Vault ──▶ Agent Injector
              │                                        │
         Pod (env var)                           Pod (file mount)
              │                                        │
         Restart to           Auto-reload ◀────── Lease renewer
         get new value        (zero downtime)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where it breaks:&lt;br&gt;
▸ Secrets as env vars require pod restarts on rotation — making rotation a deployment event with blast radius.&lt;br&gt;
▸ Vault leases expiring in long-running jobs produce auth errors that look like app bugs, not infra failures.&lt;br&gt;
▸ Secret sprawl across namespaces means rotation happens in 12 places — and one always gets missed.&lt;/p&gt;

&lt;p&gt;The rule I keep coming back to:&lt;br&gt;
→ Design rotation before you design storage. If you can't rotate a secret in under 10 minutes with no downtime, the design isn't production-ready.&lt;/p&gt;

&lt;p&gt;How I sanity-check it:&lt;br&gt;
▸ Vault Agent Injector or External Secrets Operator — decouple secret delivery from pod lifecycle.&lt;br&gt;
▸ Monthly secret access log audit — stale consumers are how you discover forgotten service accounts before attackers do.&lt;/p&gt;

&lt;p&gt;Reliability is a product feature. The engineers who treat it that way are the ones who get asked into the room.&lt;/p&gt;

&lt;p&gt;Deep dive: &lt;a href="https://neeraja-portfolio-v1.vercel.app/workflows/secrets-management-designing-for-rotation-not-just-storage" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/workflows/secrets-management-designing-for-rotation-not-just-storage&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If this triggered a war story, I'd genuinely love to hear it.&lt;/p&gt;

&lt;h1&gt;
  
  
  security #devops #kubernetes #sre
&lt;/h1&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>kubernetes</category>
      <category>terraform</category>
    </item>
    <item>
      <title>This is what separates teams that scale from teams that survive:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Fri, 01 May 2026 10:11:17 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/this-is-what-separates-teams-that-scale-from-teams-that-survive-5b0o</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/this-is-what-separates-teams-that-scale-from-teams-that-survive-5b0o</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Insight (2026-05-01)
&lt;/h1&gt;

&lt;p&gt;This is what separates teams that scale from teams that survive:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Capacity planning is a risk budget conversation, not a utilization spreadsheet&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Teams that plan capacity by extrapolating last month's P95 get surprised when a product launch doubles traffic in a week. The right frame isn't 'what utilization should we run at' — it's 'what's the asymmetric cost of being wrong in each direction, and how much buffer does that justify?'&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cost asymmetry analysis:

Over-provision by 20%:    Under-provision by 20%:

Cost: +$8K/month          Cost: Incident
                               + on-call burnout
Direct, predictable            + customer churn
                               + post-mortem
                               + team morale tax

For P0 services, 20% buffer almost always wins.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The non-obvious part:&lt;br&gt;
→ The teams who get capacity planning right treat it as an insurance calculation, not an optimization problem. Over-provision cost is direct and visible. Under-provision cost is diffuse, delayed, and always larger than it looks. The asymmetry should drive your buffer strategy — not your CFO's target utilization number.&lt;/p&gt;

&lt;p&gt;My rule:&lt;br&gt;
→ Set buffer based on incident cost, not utilization targets. For every P0 service, calculate: what does one hour of downtime cost vs one month of 20% over-provision? The math almost always justifies the buffer.&lt;/p&gt;

&lt;p&gt;Worth reading:&lt;br&gt;
▸ Google SRE Book — Being On Call and Handling Overload (ch. 11-12)&lt;br&gt;
▸ AWS/GCP cost anomaly detection — real-time signals for when your buffer is being consumed&lt;/p&gt;

&lt;p&gt;&lt;a href="https://neeraja-portfolio-v1.vercel.app/insights/capacity-planning-is-a-risk-budget-conversation-not-a-utilization-spreadsheet" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/insights/capacity-planning-is-a-risk-budget-conversation-not-a-utilization-spreadsheet&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Hiring note: engineers who think about this in system design conversations stand out immediately.&lt;/p&gt;

&lt;h1&gt;
  
  
  devops #sre #observability #platformengineering
&lt;/h1&gt;

</description>
      <category>observability</category>
      <category>sre</category>
      <category>devops</category>
      <category>platformengineering</category>
    </item>
    <item>
      <title>Not in any textbook — learned this from a 3am page:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Tue, 28 Apr 2026 10:34:18 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/not-in-any-textbook-learned-this-from-a-3am-page-5edn</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/not-in-any-textbook-learned-this-from-a-3am-page-5edn</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Workflow (2026-04-28)
&lt;/h1&gt;

&lt;p&gt;Not in any textbook — learned this from a 3am page:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kubernetes rollouts: why 'pods are Ready' is the wrong promotion gate&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Readiness is a node-local signal. Production health is a global one. Most rollout pipelines conflate the two — and that's where incidents come from.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bad gate:                         Good gate:

Deploy ──▶ Pods Ready? ──▶ Done   Deploy ──▶ Pods Ready?
           (local signal)                    │
                                             ▼
                                    SLO window check
                                    (error rate + p95)
                                             │
                                    Pass ──▶ Promote
                                    Fail ──▶ Auto-rollback
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where it breaks:&lt;br&gt;
▸ 100% Ready pods while P95 latency spikes — bad cache warmup, noisy neighbor, DB connection saturation.&lt;br&gt;
▸ HPA reacts slower than a fast rollout — you ship overload before autoscaling catches up.&lt;br&gt;
▸ Canary stuck green because metrics lack the right labels/slices to isolate the failing segment.&lt;/p&gt;

&lt;p&gt;The rule I keep coming back to:&lt;br&gt;
→ Promote only when the canary holds your SLO slice (error rate + latency) for a fixed observation window. Otherwise: auto-rollback.&lt;/p&gt;

&lt;p&gt;How I sanity-check it:&lt;br&gt;
▸ Argo Rollouts or Flagger with Prometheus gates — error rate, latency percentiles, saturation.&lt;br&gt;
▸ Alert on canary-vs-baseline deltas, not absolute thresholds. Catches regressions that pass absolute checks.&lt;/p&gt;

&lt;p&gt;Operational maturity isn't about tools — it's about designing for the failure you haven't seen yet.&lt;/p&gt;

&lt;p&gt;Deep dive: &lt;a href="https://neeraja-portfolio-v1.vercel.app/workflows/kubernetes-rollouts-why-pods-are-ready-is-the-wrong-promotion-gate" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/workflows/kubernetes-rollouts-why-pods-are-ready-is-the-wrong-promotion-gate&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Strong opinions on this? Good. I want to hear the pushback.&lt;/p&gt;

&lt;h1&gt;
  
  
  kubernetes #reliability #devops #sre
&lt;/h1&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>kubernetes</category>
      <category>terraform</category>
    </item>
    <item>
      <title>A Friday systems thinking thread — worth sitting with:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Fri, 24 Apr 2026 10:12:05 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/a-friday-systems-thinking-thread-worth-sitting-with-4fbl</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/a-friday-systems-thinking-thread-worth-sitting-with-4fbl</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Insight (2026-04-24)
&lt;/h1&gt;

&lt;p&gt;A Friday systems thinking thread — worth sitting with:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observability debt is invisible until an incident makes it expensive&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can't debug what you can't slice. Teams add dashboards for years and still can't answer the two questions that matter most in an incident: which customers are affected, and which change caused it. The problem is almost never the tool — it's the label strategy.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Observability debt accumulation:

Month 1:  Service A metrics added (no ownership labels)
Month 3:  Service B metrics added (different label schema)
Month 6:  Dashboard count: 47. Useful in incident: 3.
Month 9:  P0 incident. Can't isolate by customer/version.
          Engineer guesses. Guesses wrong. +45min MTTR.

Fix: Define label schema FIRST. Instrument second.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The non-obvious part:&lt;br&gt;
→ The teams that debug incidents fastest don't have more metrics — they have metrics that answer the right questions at the right cardinality. SLI-first instrumentation design is a force multiplier. Most teams instrument first and wonder why dashboards are noisy.&lt;/p&gt;

&lt;p&gt;My rule:&lt;br&gt;
→ Define your SLIs, then design labels that let you isolate by (service, env, version, customer tier) without exploding cardinality. Instrument last.&lt;/p&gt;

&lt;p&gt;Worth reading:&lt;br&gt;
▸ Brendan Gregg's USE Method + Google's RED Method for SLI-first design&lt;br&gt;
▸ Prometheus label best practices — cardinality anti-patterns (prometheus.io/docs)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://neeraja-portfolio-v1.vercel.app/insights/observability-debt-is-invisible-until-an-incident-makes-it-expensive" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/insights/observability-debt-is-invisible-until-an-incident-makes-it-expensive&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What's the version of this that your org gets wrong? Drop it below.&lt;/p&gt;

&lt;h1&gt;
  
  
  devops #sre #observability #platformengineering
&lt;/h1&gt;

</description>
      <category>observability</category>
      <category>sre</category>
      <category>devops</category>
      <category>platformengineering</category>
    </item>
    <item>
      <title>A hard-earned rule from incident retrospectives:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Tue, 21 Apr 2026 10:05:01 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/a-hard-earned-rule-from-incident-retrospectives-1e59</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/a-hard-earned-rule-from-incident-retrospectives-1e59</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Workflow (2026-04-21)
&lt;/h1&gt;

&lt;p&gt;A hard-earned rule from incident retrospectives:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitOps drift: the silent accumulation that makes clusters unmanageable&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;GitOps promises Git as the source of truth. The reality: every manual &lt;code&gt;kubectl&lt;/code&gt; during an incident is a lie you told your cluster and forgot to retract.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GitOps truth gap over time:

Week 1:  Git ══════════ Cluster  (clean)
Week 4:  Git ══════╌╌╌╌ Cluster  (2 manual patches)
Week 12: Git ════╌╌╌╌╌╌╌╌╌╌╌╌╌  (drift accumulates)
                         Cluster  (unknown state)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where it breaks:&lt;br&gt;
▸ Manual patches during incidents create cluster state Git doesn't know about — Argo/Flux will overwrite it silently.&lt;br&gt;
▸ Secrets managed outside GitOps (sealed-secrets, Vault agent) drift independently — invisible in sync status.&lt;br&gt;
▸ Multi-cluster setups multiply drift: each cluster diverges at its own pace once human intervention happens.&lt;/p&gt;

&lt;p&gt;The rule I keep coming back to:&lt;br&gt;
→ Treat every manual cluster change as a 5-minute loan. Commit it back to Git before the incident closes — or it's gone.&lt;/p&gt;

&lt;p&gt;How I sanity-check it:&lt;br&gt;
▸ Argo CD drift detection dashboard — surface out-of-sync resources before they become incident contributors.&lt;br&gt;
▸ Weekly diff job: live cluster state vs Git. Opens a PR for anything untracked. Makes drift visible before it's painful.&lt;/p&gt;

&lt;p&gt;The best platform teams I've seen measure success by how rarely product teams have to think about infrastructure.&lt;/p&gt;

&lt;p&gt;Deep dive: &lt;a href="https://neeraja-portfolio-v1.vercel.app/workflows/gitops-drift-the-silent-accumulation-that-makes-clusters-unmanageable" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/workflows/gitops-drift-the-silent-accumulation-that-makes-clusters-unmanageable&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Curious what guardrails you've built around this. Drop your pattern below.&lt;/p&gt;

&lt;h1&gt;
  
  
  gitops #kubernetes #devops #platformengineering
&lt;/h1&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>kubernetes</category>
      <category>terraform</category>
    </item>
    <item>
      <title>Something every senior engineer learns the expensive way:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Sun, 19 Apr 2026 16:53:12 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/something-every-senior-engineer-learns-the-expensive-way-41cd</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/something-every-senior-engineer-learns-the-expensive-way-41cd</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Workflow (2026-04-19)
&lt;/h1&gt;

&lt;p&gt;Something every senior engineer learns the expensive way:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terraform DAGs at scale: when the graph becomes the hazard&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Terraform's dependency graph is elegant at small scale. At 500+ resources across a mono-repo, it becomes the most dangerous part of your infrastructure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SAFE (small module):              DANGEROUS (at scale):

[vpc] ──▶ [subnet] ──▶ [ec2]     [shared-net] ──▶ [team-a-infra]
                                          │         [team-b-infra]
                                          │         [team-c-infra]
                                          │         [data-layer]
                                  One change → fan-out destroy/create
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where it breaks:&lt;br&gt;
▸ Implicit ordering assumptions survive until a refactor exposes them — usually as an unplanned destroy chain in prod.&lt;br&gt;
▸ Fan-out graphs make blast radius review near-impossible. 'What does this change affect?' has no fast answer.&lt;br&gt;
▸ &lt;code&gt;depends_on&lt;/code&gt; papering over bad module interfaces — it fixes the symptom and couples the modules permanently.&lt;/p&gt;

&lt;p&gt;The rule I keep coming back to:&lt;br&gt;
→ If a module needs &lt;code&gt;depends_on&lt;/code&gt; to be safe, the module boundary is wrong. Redesign the interface — don't paper over it.&lt;/p&gt;

&lt;p&gt;How I sanity-check it:&lt;br&gt;
▸ &lt;code&gt;terraform graph | dot -Tsvg &amp;gt; graph.svg&lt;/code&gt; — visualize fan-out and cycles before every major refactor.&lt;br&gt;
▸ Gate all applies with OPA/Conftest + mandatory human review on any planned destroy operations.&lt;/p&gt;

&lt;p&gt;The difference between a senior engineer and a principal is knowing which guardrails to build before you need them.&lt;/p&gt;

&lt;p&gt;Deep dive: &lt;a href="https://neeraja-portfolio-v1.vercel.app/workflows/terraform-dags-at-scale-when-the-graph-becomes-the-hazard" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/workflows/terraform-dags-at-scale-when-the-graph-becomes-the-hazard&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Curious what guardrails you've built around this. Drop your pattern below.&lt;/p&gt;

&lt;h1&gt;
  
  
  terraform #iac #devops #sre
&lt;/h1&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>kubernetes</category>
      <category>terraform</category>
    </item>
    <item>
      <title>Not in any textbook — learned this from a 3am page:</title>
      <dc:creator>Neeraja Khanapure</dc:creator>
      <pubDate>Sat, 18 Apr 2026 11:03:01 +0000</pubDate>
      <link>https://dev.to/neeraja_khanapure_4a33a5f/not-in-any-textbook-learned-this-from-a-3am-page-ipj</link>
      <guid>https://dev.to/neeraja_khanapure_4a33a5f/not-in-any-textbook-learned-this-from-a-3am-page-ipj</guid>
      <description>&lt;h1&gt;
  
  
  LinkedIn Draft — Workflow (2026-04-18)
&lt;/h1&gt;

&lt;p&gt;Not in any textbook — learned this from a 3am page:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kubernetes cost spikes: the usual suspects and how to find them fast&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cloud bills spike in Kubernetes for the same reasons every time. None of them are visible in the default dashboards — and most of them are invisible until month-end.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cost leak sources (ranked by surprise factor):

1. Unset resource requests   → scheduler packs nodes → OOM → over-provision
2. Autoscaler scale-down lag → zombie nodes after traffic spike
3. Log pipelines w/o sample  → 40% of bill, 0% of dashboards
4. Idle namespaces           → dev clusters running 24/7
5. Spot interruption gaps    → fallback to on-demand, never reverted
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where it breaks:&lt;br&gt;
▸ Missing resource requests let the scheduler over-pack nodes — when pods OOM, you over-provision to compensate.&lt;br&gt;
▸ Cluster autoscaler adds nodes faster than it removes them. Spot interruptions leave zombie capacity for hours.&lt;br&gt;
▸ Logging agents (Fluentd/Filebeat) on every node with no sampling become the largest line item nobody owns.&lt;/p&gt;

&lt;p&gt;The rule I keep coming back to:&lt;br&gt;
→ Every workload needs requests AND limits. Review autoscaler scale-down thresholds monthly. Sample logs at source, not at the sink.&lt;/p&gt;

&lt;p&gt;How I sanity-check it:&lt;br&gt;
▸ Kubecost or OpenCost — per-namespace/team attribution. Without this, no one feels accountable for the number.&lt;br&gt;
▸ KEDA for event-driven workload scaling — eliminates idle replicas without sacrificing responsiveness.&lt;/p&gt;

&lt;p&gt;The best platform teams I've seen measure success by how rarely product teams have to think about infrastructure.&lt;/p&gt;

&lt;p&gt;Deep dive: &lt;a href="https://neeraja-portfolio-v1.vercel.app/workflows/kubernetes-cost-spikes-the-usual-suspects-and-how-to-find-them-fast" rel="noopener noreferrer"&gt;https://neeraja-portfolio-v1.vercel.app/workflows/kubernetes-cost-spikes-the-usual-suspects-and-how-to-find-them-fast&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Strong opinions on this? Good. I want to hear the pushback.&lt;/p&gt;

&lt;h1&gt;
  
  
  kubernetes #finops #devops #platformengineering
&lt;/h1&gt;

</description>
      <category>devops</category>
      <category>sre</category>
      <category>kubernetes</category>
      <category>terraform</category>
    </item>
  </channel>
</rss>
