<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mrinal Narang</title>
    <description>The latest articles on DEV Community by Mrinal Narang (@mrinal_narang_13a3d00eb37).</description>
    <link>https://dev.to/mrinal_narang_13a3d00eb37</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3981491%2F6c07d98f-ca97-4312-bc7b-40a28c3e7d8a.jpg</url>
      <title>DEV Community: Mrinal Narang</title>
      <link>https://dev.to/mrinal_narang_13a3d00eb37</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mrinal_narang_13a3d00eb37"/>
    <language>en</language>
    <item>
      <title>Two Kubernetes Decisions Nobody Writes About Honestly</title>
      <dc:creator>Mrinal Narang</dc:creator>
      <pubDate>Tue, 30 Jun 2026 03:14:00 +0000</pubDate>
      <link>https://dev.to/mrinal_narang_13a3d00eb37/two-kubernetes-decisions-nobody-writes-about-honestly-29fb</link>
      <guid>https://dev.to/mrinal_narang_13a3d00eb37/two-kubernetes-decisions-nobody-writes-about-honestly-29fb</guid>
      <description>&lt;h2&gt;
  
  
  1. Node Group Sizing
&lt;/h2&gt;

&lt;p&gt;Fewer large nodes vs. many small ones. Textbooks don't cover this.&lt;/p&gt;

&lt;p&gt;We ran 10 nodes, 32 CPU each. Seemed efficient.&lt;/p&gt;

&lt;p&gt;Problem: One node dies, 320 CPU worth of workloads need to reschedule. Cluster autoscaler couldn't handle it. Pods sat pending for 10 minutes.&lt;/p&gt;

&lt;p&gt;We switched to 20 nodes, 16 CPU each. Same total capacity.&lt;/p&gt;

&lt;p&gt;One node dies now? 160 CPU to reschedule. Autoscaler catches it in 90 seconds. Scheduling is tighter, but failures are isolated.&lt;/p&gt;

&lt;p&gt;Cost stayed the same. Blast radius halved.&lt;/p&gt;

&lt;p&gt;Why nobody writes about this: The tradeoff isn't obvious. Large nodes are "more efficient." Smaller nodes are "more resilient." Both are true. It depends on whether you'd rather have one big problem or many small ones.&lt;/p&gt;

&lt;p&gt;We picked smaller nodes because a node failure was our actual failure mode. Not resource efficiency.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Readiness vs Liveness Probes
&lt;/h2&gt;

&lt;p&gt;Misconfigure these and your cluster looks like it's melting.&lt;/p&gt;

&lt;p&gt;Readiness probe: "Can this pod take traffic?"&lt;/p&gt;

&lt;p&gt;Liveness probe: "Is this pod alive? Restart it if not."&lt;/p&gt;

&lt;p&gt;One team set readiness = liveness. Same probe checked both.&lt;/p&gt;

&lt;p&gt;Probe logic: "If I can reach the database, I'm ready."&lt;/p&gt;

&lt;p&gt;Database gets slow. Probe fails. Pod becomes "not ready." Load balancer removes it from rotation (correct).&lt;/p&gt;

&lt;p&gt;But liveness also failed. Kubernetes killed the pod and restarted it.&lt;/p&gt;

&lt;p&gt;New pod starts. Probe fails immediately (database still slow). Gets killed. Restarted.&lt;/p&gt;

&lt;p&gt;This cascaded across 30 pods. 30 restarts/minute. New pods spent 100% of time restarting.&lt;/p&gt;

&lt;p&gt;Looks like an application bug for the first 20 minutes. Actually a probe configuration bug.&lt;/p&gt;

&lt;p&gt;Fix: Readiness checks "can I take traffic right now?" Liveness checks "am I fundamentally broken?" Use separate probes.&lt;/p&gt;

&lt;p&gt;Readiness: Check database connection with short timeout. Fail if slow. This is reasonable - don't send traffic to slow pods.&lt;/p&gt;

&lt;p&gt;Liveness: Check if the process is responding at all. Much stricter threshold. Only kill if truly hung.&lt;/p&gt;

&lt;p&gt;Same database slowness? Pods become unready. Traffic reroutes. Cluster stabilizes. No restart loop.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Connection
&lt;/h2&gt;

&lt;p&gt;Both decisions have hidden failure modes that surface under stress.&lt;/p&gt;

&lt;p&gt;Node sizing looks good until a node fails and you realize your blast radius is too large.&lt;/p&gt;

&lt;p&gt;Probe configuration looks good until the database gets slow and your whole cluster starts thrashing.&lt;/p&gt;

&lt;p&gt;Neither is "wrong." Both have tradeoffs. The difference is understanding the failure mode you're actually optimizing for.&lt;/p&gt;

&lt;p&gt;Large nodes are efficient until they aren't.&lt;/p&gt;

&lt;p&gt;Readiness = Liveness is simple until cascading restarts make it obvious.&lt;/p&gt;

&lt;p&gt;If you're running EKS with 8+ large nodes, consider downsizing and multiplying. If you've never hit a restart loop from probe misconfiguration, you will eventually.&lt;/p&gt;

&lt;p&gt;When you do, check if readiness and liveness are the same. Usually they are.&lt;/p&gt;




&lt;h1&gt;
  
  
  Kubernetes #EKS #DevOps #SRE #IncidentResponse
&lt;/h1&gt;

</description>
    </item>
    <item>
      <title>Blameless Postmortems in Practice</title>
      <dc:creator>Mrinal Narang</dc:creator>
      <pubDate>Mon, 29 Jun 2026 03:07:00 +0000</pubDate>
      <link>https://dev.to/mrinal_narang_13a3d00eb37/blameless-postmortems-in-practice-3ie5</link>
      <guid>https://dev.to/mrinal_narang_13a3d00eb37/blameless-postmortems-in-practice-3ie5</guid>
      <description>&lt;p&gt;Most teams claim they do blameless postmortems.&lt;/p&gt;

&lt;p&gt;Then the incident happens.&lt;/p&gt;

&lt;p&gt;"Jane didn't validate the input."&lt;/p&gt;

&lt;p&gt;"The on-call missed the alert."&lt;/p&gt;

&lt;p&gt;"We should have caught this in code review."&lt;/p&gt;

&lt;p&gt;That's blame. It's just dressed up in process language.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Gap
&lt;/h2&gt;

&lt;p&gt;Blameless postmortems aren't about ignoring human error. They're about understanding why a reasonable person made a decision that, in hindsight, was wrong.&lt;/p&gt;

&lt;p&gt;The question isn't: "What did Jane do wrong?"&lt;/p&gt;

&lt;p&gt;It's: "What made Jane's action seem reasonable at the time?"&lt;/p&gt;

&lt;p&gt;If you can't answer the second question, your postmortem isn't blameless. It's just performative.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Happens
&lt;/h2&gt;

&lt;p&gt;Blameless postmortem (real):&lt;/p&gt;

&lt;p&gt;"The deployment happened without running tests. Why?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The test environment was down for maintenance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Nobody documented which environment Jane should use instead.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It was 11 PM on a Friday.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Jane has deployed 200 times without incident.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The process allowed skipping tests if 'urgent.'&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So we added automated test gates that can't be bypassed. We documented the backup environment. We made urgent deployments require two people."&lt;/p&gt;

&lt;p&gt;Blamed postmortem (disguised):&lt;/p&gt;

&lt;p&gt;"The deployment happened without running tests.&lt;/p&gt;

&lt;p&gt;Root cause: Insufficient process discipline.&lt;/p&gt;

&lt;p&gt;Action item: Remind team to follow procedures."&lt;/p&gt;

&lt;p&gt;One actually changes behavior. One just documents that someone messed up.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Test
&lt;/h2&gt;

&lt;p&gt;Read your last three postmortems.&lt;/p&gt;

&lt;p&gt;Count how many times you see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;"Person X should have..."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;"We should have caught..."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;"Insufficient discipline..."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;"Better communication would have..."&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the focus is on what people should do differently, you're not doing blameless postmortems. You're doing blame with better language.&lt;/p&gt;

&lt;p&gt;Real blameless postmortems focus on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;What system allowed this to happen?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;What information was missing?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;What would have made the better decision obvious?&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;What tool could have caught this?&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Shift That Matters
&lt;/h2&gt;

&lt;p&gt;Blame mindset: "How do we stop people from doing this?"&lt;/p&gt;

&lt;p&gt;Blameless mindset: "How do we build systems where the wrong decision is harder than the right one?"&lt;/p&gt;

&lt;p&gt;Example:&lt;/p&gt;

&lt;p&gt;Blame: "The engineer deployed without approval."&lt;/p&gt;

&lt;p&gt;Action: "Require manual approvals before deployment."&lt;/p&gt;

&lt;p&gt;Result: Engineers find workarounds. Deployments slow. Nothing changes.&lt;/p&gt;

&lt;p&gt;Blameless: "The engineer deployed without approval. Why did that seem reasonable?"&lt;/p&gt;

&lt;p&gt;Answer: "The approval process was taking 2 hours, and the customer issue was urgent. The engineer bypassed it."&lt;/p&gt;

&lt;p&gt;Action: "Implement auto-approval for critical hotfixes if all tests pass."&lt;/p&gt;

&lt;p&gt;Result: Urgent deployments don't require workarounds. Actual behavior changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Questions That Reveal Blame
&lt;/h2&gt;

&lt;p&gt;"Why did the on-call miss the alert?"&lt;/p&gt;

&lt;p&gt;vs.&lt;/p&gt;

&lt;p&gt;"Why didn't the on-call see the alert? Was the alert buried in noise? Was the alert configured wrong? Was the on-call context-switching too much?"&lt;/p&gt;

&lt;p&gt;First question assumes blame. Second question discovers systems.&lt;/p&gt;

&lt;p&gt;"The engineer didn't validate input."&lt;/p&gt;

&lt;p&gt;vs.&lt;/p&gt;

&lt;p&gt;"Why wasn't input validation enforced at the framework level? Why didn't the linter catch this? Why was this pattern possible?"&lt;/p&gt;

&lt;p&gt;First question is about the engineer. Second question is about the system.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Works
&lt;/h2&gt;

&lt;p&gt;Document the decision-making context. Not judgment.&lt;/p&gt;

&lt;p&gt;"The engineer believed the data was validated upstream" is context.&lt;/p&gt;

&lt;p&gt;"The engineer was careless" is judgment.&lt;/p&gt;

&lt;p&gt;Ask: "If this exact situation happened tomorrow, would the same decision seem reasonable to a competent person?"&lt;/p&gt;

&lt;p&gt;If yes, it's a system problem. Fix the system.&lt;/p&gt;

&lt;p&gt;If no, you've found something else.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Honest Part
&lt;/h2&gt;

&lt;p&gt;Real blameless postmortems are harder than blamed ones.&lt;/p&gt;

&lt;p&gt;It's easier to say "Person did bad thing" than to trace the systems that made the bad thing seem reasonable.&lt;/p&gt;

&lt;p&gt;It requires admitting that your process enabled the failure.&lt;/p&gt;

&lt;p&gt;It requires changing things instead of just documenting them.&lt;/p&gt;

&lt;p&gt;But it's the only approach that actually changes behavior.&lt;/p&gt;

&lt;p&gt;Teams that claim "blameless" but still use postmortems as accountability theater don't fix anything. They just have better documentation of blame.&lt;/p&gt;

&lt;p&gt;Teams that actually ask "why would a reasonable person make this decision?" build systems where the failures stop happening.&lt;/p&gt;

&lt;p&gt;Check your last postmortem. What were the action items?&lt;/p&gt;

&lt;p&gt;If they're mostly about "team discipline" or "better communication," you're doing blame with better language.&lt;/p&gt;

&lt;p&gt;If they're about systems, tools, and removing friction from the right path, you're actually being blameless.&lt;/p&gt;




&lt;h1&gt;
  
  
  DevOps #IncidentResponse #Postmortem #Blameless #TeamCulture #SRE
&lt;/h1&gt;

</description>
      <category>devops</category>
      <category>management</category>
      <category>sre</category>
    </item>
    <item>
      <title>Scaling Cooldown Tuning: Stop Your Autoscaler From Thrashing</title>
      <dc:creator>Mrinal Narang</dc:creator>
      <pubDate>Mon, 29 Jun 2026 02:57:00 +0000</pubDate>
      <link>https://dev.to/mrinal_narang_13a3d00eb37/scaling-cooldown-tuning-stop-your-autoscaler-from-thrashing-24bg</link>
      <guid>https://dev.to/mrinal_narang_13a3d00eb37/scaling-cooldown-tuning-stop-your-autoscaler-from-thrashing-24bg</guid>
      <description>&lt;p&gt;Your HPA is flapping.&lt;/p&gt;

&lt;p&gt;Pods spin up. Traffic dips. Pods spin down. Traffic returns. Pods spin up again. All within 90 seconds.&lt;/p&gt;

&lt;p&gt;This costs money and stability. Every scale event creates pod churn. New pods need to warm up. Connections restart. Metrics refresh.&lt;/p&gt;

&lt;p&gt;The fix isn't complicated. It's tuning cooldown periods.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Flapping Looks Like
&lt;/h2&gt;

&lt;p&gt;Before tuning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;9:15 AM: CPU hits 75%. HPA scales 3→5 pods.&lt;/li&gt;
&lt;li&gt;9:16 AM: Traffic normalizes. CPU drops to 60%.&lt;/li&gt;
&lt;li&gt;9:17 AM: HPA scales 5→3 pods (scaleDown default is 300s, but we weren't respecting it).&lt;/li&gt;
&lt;li&gt;9:18 AM: Batch request comes in. CPU jumps to 80%.&lt;/li&gt;
&lt;li&gt;9:19 AM: HPA scales 3→5 pods again.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every 60-90 seconds. Constantly. Pod logs show connection resets every minute.&lt;/p&gt;

&lt;p&gt;Billing spike? $200/day in unnecessary compute because pods kept restarting.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tuning
&lt;/h2&gt;

&lt;p&gt;We changed from defaults:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;behavior&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;scaleDown&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;stabilizationWindowSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;300&lt;/span&gt;    &lt;span class="c1"&gt;# wait 5 min before scaling down&lt;/span&gt;
    &lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Percent&lt;/span&gt;
      &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50&lt;/span&gt;
      &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
  &lt;span class="na"&gt;scaleUp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;stabilizationWindowSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;      &lt;span class="c1"&gt;# scale up immediately&lt;/span&gt;
    &lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Percent&lt;/span&gt;
      &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
      &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why these values?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;scaleDown stabilizationWindow: 300s (5 min).&lt;/strong&gt; If CPU drops below threshold, wait 5 minutes before actually scaling down. Most traffic spikes last longer than 90 seconds. This prevents reacting to temporary dips. One team tried 60s, still flapping. 300s worked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;scaleDown percent: 50.&lt;/strong&gt; Remove half the pods at a time, not all of them. If you're at 5 pods and scale down to 3, you're making a big bet that you don't need those 2. Removing 50% (5→3) is safer than removing 100%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;scaleUp stabilizationWindow: 0.&lt;/strong&gt; When CPU hits 75%, scale immediately. You have customers waiting. Slow scale-up means slow response time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;scaleUp percent: 100.&lt;/strong&gt; Double the pod count if needed. If you're at 3 pods and hitting limits, jump to 6. Better to overprovision briefly than make customers wait.&lt;/p&gt;

&lt;h2&gt;
  
  
  After Tuning
&lt;/h2&gt;

&lt;p&gt;Same 9:15 AM scenario:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;9:15 AM: CPU hits 75%. HPA scales 3→5 pods immediately.&lt;/li&gt;
&lt;li&gt;9:20 AM: Traffic stabilizes. System waits (stabilization window).&lt;/li&gt;
&lt;li&gt;9:25 AM: CPU still below 60%. HPA scales 5→3 pods.&lt;/li&gt;
&lt;li&gt;9:26 AM: No more thrashing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Pod restart rate dropped 95%. Load balancer connection resets went from 60/min to 2/min.&lt;/p&gt;

&lt;p&gt;Monthly compute cost dropped $1,400 (was $8,500/month due to churn, now $7,100).&lt;/p&gt;

&lt;h2&gt;
  
  
  The Principle
&lt;/h2&gt;

&lt;p&gt;Scale up fast, scale down slow.&lt;/p&gt;

&lt;p&gt;Customers need capacity now. They don't care if you have extra pods for 5 minutes. They do care if you're constantly churning them.&lt;/p&gt;

&lt;p&gt;Stabilization windows let temporary spikes and dips pass without action. Percent-based scaling lets you adjust gradually instead of binary yes/no decisions.&lt;/p&gt;

&lt;p&gt;One team still uses defaults. They have pod churn every 90 seconds. Another adjusted to these values and saw pod churn once per day, only when actual traffic patterns genuinely changed.&lt;/p&gt;

&lt;p&gt;Your HPA is probably thrashing. Check your stabilization windows. If you see pod restart spikes that correlate with CPU threshold crossings, you've found it.&lt;/p&gt;

&lt;p&gt;Set scaleDown to 300s. Set scaleUp to 0. Adjust percents based on your app. Test. Most teams see 70-80% reduction in unnecessary scaling events.&lt;/p&gt;




&lt;h1&gt;
  
  
  Kubernetes #Autoscaling #HPA #CostOptimization #DevOps
&lt;/h1&gt;

</description>
      <category>devops</category>
      <category>infrastructure</category>
      <category>kubernetes</category>
      <category>sre</category>
    </item>
    <item>
      <title>Dependency Mapping and Hidden Failure Modes</title>
      <dc:creator>Mrinal Narang</dc:creator>
      <pubDate>Sun, 28 Jun 2026 02:57:00 +0000</pubDate>
      <link>https://dev.to/mrinal_narang_13a3d00eb37/dependency-mapping-and-hidden-failure-modes-5500</link>
      <guid>https://dev.to/mrinal_narang_13a3d00eb37/dependency-mapping-and-hidden-failure-modes-5500</guid>
      <description>&lt;p&gt;You've got your architecture diagram.&lt;/p&gt;

&lt;p&gt;It looks good. Services connected with clear lines. Data flows. Integration points.&lt;/p&gt;

&lt;p&gt;Solid design.&lt;/p&gt;

&lt;p&gt;Then production goes down.&lt;/p&gt;

&lt;p&gt;And the outage spreads through a dependency nobody drew on that diagram.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Reality
&lt;/h2&gt;

&lt;p&gt;Most outages don't follow the architecture diagram. They follow the actual code.&lt;/p&gt;

&lt;p&gt;You have a service that calls Service A. Service A calls Service B synchronously. Service B reads from a cache. That cache is backed by Service C. Service C has an undocumented polling relationship with Service D.&lt;/p&gt;

&lt;p&gt;Nowhere on your diagram.&lt;/p&gt;

&lt;p&gt;But when D fails? The entire stack goes down. In order: C gets slow, B times out, A gets backed up, your service drowns in connection timeouts.&lt;/p&gt;

&lt;p&gt;Customers notice before your alerts fire.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Gets Missed
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Implicit dependencies.&lt;/strong&gt; Service A doesn't explicitly call Service B. But A reads from a table that B populates. If B stops writing, A fails silently. Nobody knew they were coupled.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transitive failures.&lt;/strong&gt; You know you depend on the database. What you don't know is that the database client library maintains a background connection pool that hits an internal service. That service goes down. Your database works fine. Your application hangs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Async failures hidden as success.&lt;/strong&gt; A request succeeds, returns 200. But a background job that's supposed to process the data never fires. The dependency broke, you didn't notice for hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shared infrastructure you forgot about.&lt;/strong&gt; Two services running on the same Kubernetes node. One burns CPU, the other starves. You didn't plan for them to interfere. They do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Third-party API cascades.&lt;/strong&gt; Your service integrates with an API that calls another API internally. When that internal API is slow, your service times out. You didn't know about the dependency. The API provider didn't document it.&lt;/p&gt;

&lt;h2&gt;
  
  
  How You Actually Discover Dependencies
&lt;/h2&gt;

&lt;p&gt;You don't discover them during planning sessions. You discover them during incidents.&lt;/p&gt;

&lt;p&gt;2 AM. Everything is burning. You start tracing requests. You find a call you didn't know existed. You look at the code. "Oh. Yeah. Service X calls Service Y as a fire-and-forget."&lt;/p&gt;

&lt;p&gt;You knew about Service X. You knew about Service Y. You didn't know they were connected.&lt;/p&gt;

&lt;p&gt;By the time you're discovering this, customers have been down for 40 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tools Help But Don't Solve It
&lt;/h2&gt;

&lt;p&gt;Network traffic analysis shows connections. Distributed tracing reveals call chains. APM tools map service interactions.&lt;/p&gt;

&lt;p&gt;These help. But they only show you what's currently happening. If a dependency is dormant, it's invisible. If a failure path is rare, you won't see it until it happens.&lt;/p&gt;

&lt;p&gt;A service that calls another service only during payment processing won't show up in your dependency map until someone tries to make a payment during an outage.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Works
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Run incidents. Deliberately.&lt;/strong&gt; Gamedays and chaos engineering aren't about proving resilience. They're about discovering unknown dependencies before they become production incidents. Shut down a service you think is non-critical and watch what breaks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trace the data, not the diagram.&lt;/strong&gt; Follow what happens to a customer request. Where does it go? What systems read the results? What systems depend on side effects? Write it down. That's your actual architecture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Check what you're not monitoring.&lt;/strong&gt; If you're not alerting on a dependency, you probably don't know about it. Set a timer. Pick a random service. Ask: what would break if this disappeared right now? If you don't know, you've found a hidden dependency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Document after incidents.&lt;/strong&gt; The postmortem is the best time to update your architecture diagram. You now know something that wasn't documented before. Write it down so the next person doesn't learn it during an outage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Assume cache failures.&lt;/strong&gt; Every cache hit is a hidden dependency. Every background job is a failure mode. Every async operation is a silent failure waiting to happen. Don't assume these are optional.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Honest Answer
&lt;/h2&gt;

&lt;p&gt;You can't map every dependency. Some are emergent properties of how systems interact. Some only become relevant during specific failure scenarios.&lt;/p&gt;

&lt;p&gt;But you can discover them faster.&lt;/p&gt;

&lt;p&gt;Run incidents before production does. Trace requests end-to-end. Alert on the things you're not expecting to fail. When something breaks, update your diagram.&lt;/p&gt;

&lt;p&gt;Most outages spread through things you didn't know existed. The goal isn't to prevent that.&lt;/p&gt;

&lt;p&gt;It's to find out what you don't know before the customers do.&lt;/p&gt;




&lt;h1&gt;
  
  
  DevOps #SRE #Architecture #IncidentResponse #Systems #Observability
&lt;/h1&gt;

</description>
      <category>architecture</category>
      <category>microservices</category>
      <category>sre</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Kubernetes Cost Optimization: Stop Buying Compute You Never Needed</title>
      <dc:creator>Mrinal Narang</dc:creator>
      <pubDate>Thu, 25 Jun 2026 11:08:00 +0000</pubDate>
      <link>https://dev.to/mrinal_narang_13a3d00eb37/kubernetes-cost-optimization-stop-buying-compute-you-never-needed-5cf6</link>
      <guid>https://dev.to/mrinal_narang_13a3d00eb37/kubernetes-cost-optimization-stop-buying-compute-you-never-needed-5cf6</guid>
      <description>&lt;p&gt;Ask teams how they're reducing Kubernetes costs and you hear: Spot Instances, autoscaling, Reserved Instances, Graviton.&lt;/p&gt;

&lt;p&gt;All worthwhile.&lt;/p&gt;

&lt;p&gt;But here's what I've found actually works:&lt;/p&gt;

&lt;p&gt;Stop paying for resources workloads never use.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Problem
&lt;/h2&gt;

&lt;p&gt;Most clusters are carrying years of operational assumptions.&lt;/p&gt;

&lt;p&gt;"Add some buffer." "Double the memory just in case." "Optimize later."&lt;/p&gt;

&lt;p&gt;Months later, those assumptions become production reality.&lt;/p&gt;

&lt;p&gt;And production reality becomes a monthly invoice.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Happens
&lt;/h2&gt;

&lt;p&gt;Compare what pods request versus what they use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Services requesting 2 CPU consuming 200m&lt;/li&gt;
&lt;li&gt;Apps requesting 4 GB RAM consuming 800 MB&lt;/li&gt;
&lt;li&gt;Workloads requesting 8 GB using less than 1.5 GB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Kubernetes reserves node capacity for resources that are never used. Extra nodes get provisioned. Not because applications need them. Because requests claim they do.&lt;/p&gt;

&lt;p&gt;One team reduced their monthly bill by $40,000 just by bringing pod requests in line with actual usage.&lt;/p&gt;

&lt;p&gt;No new technology. No architecture changes. Just honesty.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Other Hidden Savings
&lt;/h2&gt;

&lt;p&gt;Every cluster has workloads nobody needs. Pods that haven't processed meaningful traffic in months. Legacy integrations kept alive "just in case."&lt;/p&gt;

&lt;p&gt;Reporting jobs that run 24/7 but only need 8 hours. Data processing jobs running overnight despite having no users.&lt;/p&gt;

&lt;p&gt;Scheduling workloads to match actual demand? Teams save 30-40% on cluster costs.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Doesn't Help
&lt;/h2&gt;

&lt;p&gt;Autoscaling doesn't fix bad sizing. If pods start oversized, autoscaling just scales the oversized pods. Costs scale with the oversized assumptions.&lt;/p&gt;

&lt;p&gt;Resource limits set during panic rarely get revisited. The emergency passes. The oversized limits stay. Years later, you're still paying for a worst-case scenario.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Works
&lt;/h2&gt;

&lt;p&gt;Use Grafana, Prometheus, Metrics Server, Kubecost.&lt;/p&gt;

&lt;p&gt;Compare requests versus usage. Check pod activity patterns. Review scaling behavior. Look at which services consume capacity but deliver little value.&lt;/p&gt;

&lt;p&gt;The data usually tells a very different story than assumptions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Simple Question
&lt;/h2&gt;

&lt;p&gt;If every workload had to justify its resource requests today, how many would survive unchanged?&lt;/p&gt;

&lt;p&gt;That's usually where the real savings begin.&lt;/p&gt;




&lt;h1&gt;
  
  
  Kubernetes #CostOptimization #DevOps #CloudEngineering #AWS
&lt;/h1&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>beginners</category>
      <category>career</category>
    </item>
    <item>
      <title>Dashboard Design for Incident Response</title>
      <dc:creator>Mrinal Narang</dc:creator>
      <pubDate>Wed, 24 Jun 2026 11:03:00 +0000</pubDate>
      <link>https://dev.to/mrinal_narang_13a3d00eb37/dashboard-design-for-incident-response-24i7</link>
      <guid>https://dev.to/mrinal_narang_13a3d00eb37/dashboard-design-for-incident-response-24i7</guid>
      <description>&lt;p&gt;Most dashboards answer one question: &lt;em&gt;Is everything okay?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;During an incident, nobody's asking that.&lt;/p&gt;

&lt;p&gt;The real question: &lt;em&gt;What broke, where, and what changed?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Most dashboards fail at incidents because they were built for monitoring, not troubleshooting.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;A typical dashboard shows CPU, memory, disk, network, requests, uptime.&lt;/p&gt;

&lt;p&gt;Useful for routine checks.&lt;/p&gt;

&lt;p&gt;During an outage? Just noise.&lt;/p&gt;

&lt;p&gt;You're not looking for reassurance. You're looking for evidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Different Jobs
&lt;/h2&gt;

&lt;p&gt;Most teams put everything on one dashboard. That's a compromise that doesn't work for either job.&lt;/p&gt;

&lt;p&gt;Monitoring dashboard: Is the platform healthy? SLAs being met? Resources used correctly?&lt;/p&gt;

&lt;p&gt;Incident dashboard: What failed? When? What changed? Where do I look next?&lt;/p&gt;

&lt;p&gt;Same tools, different purposes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Works During an Outage
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Error rate front and center.&lt;/strong&gt; 5XX errors, exceptions, failed transactions. Failures tell the story faster than CPU metrics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Timeline on the graph.&lt;/strong&gt; Mark deployments, infrastructure changes, scaling events. Most incidents start right after something changed. Make this visible in one second.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dependency health.&lt;/strong&gt; A healthy app talking to a dead database is not healthy. Dependencies often point to root cause faster than app metrics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Golden signals.&lt;/strong&gt; Latency, traffic, errors, saturation. These beat hundreds of infrastructure metrics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Logs visible.&lt;/strong&gt; Top exceptions, error spikes, failed endpoints. Reduce tab-switching during incidents.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Service map.&lt;/strong&gt; Which services depend on the failing one? Visual dependency maps answer this instantly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alert state.&lt;/strong&gt; Which alerts fired? Which started first? First alert usually beats alert #100 for root cause.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Test
&lt;/h2&gt;

&lt;p&gt;For every panel: How does this help me resolve the incident faster?&lt;/p&gt;

&lt;p&gt;If the answer isn't obvious, remove it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;

&lt;p&gt;EKS outage. Don't show cluster CPU and memory.&lt;/p&gt;

&lt;p&gt;Show:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Failed requests by service&lt;/li&gt;
&lt;li&gt;Pod restarts&lt;/li&gt;
&lt;li&gt;Readiness failures&lt;/li&gt;
&lt;li&gt;Recent deployments&lt;/li&gt;
&lt;li&gt;HPA scaling events&lt;/li&gt;
&lt;li&gt;Dependency latency&lt;/li&gt;
&lt;li&gt;Top exceptions&lt;/li&gt;
&lt;li&gt;Queue backlogs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One tells you the cluster exists. The other helps you fix it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Point
&lt;/h2&gt;

&lt;p&gt;Monitoring dashboards tell you something broke.&lt;/p&gt;

&lt;p&gt;Incident dashboards help you figure out why.&lt;/p&gt;

&lt;p&gt;During an outage, only the second one matters.&lt;/p&gt;




&lt;h1&gt;
  
  
  DevOps #SRE #Monitoring #Kubernetes #IncidentResponse #Dashboard
&lt;/h1&gt;

</description>
      <category>programming</category>
      <category>tutorial</category>
      <category>devops</category>
      <category>career</category>
    </item>
    <item>
      <title>Blackbox Monitoring vs Internal Metrics - The Gap Between "Healthy" and "Working"</title>
      <dc:creator>Mrinal Narang</dc:creator>
      <pubDate>Sun, 21 Jun 2026 11:00:00 +0000</pubDate>
      <link>https://dev.to/mrinal_narang_13a3d00eb37/blackbox-monitoring-vs-internal-metrics-the-gap-between-healthy-and-working-15ed</link>
      <guid>https://dev.to/mrinal_narang_13a3d00eb37/blackbox-monitoring-vs-internal-metrics-the-gap-between-healthy-and-working-15ed</guid>
      <description>&lt;p&gt;You've probably had this incident. Dashboards are all green. CPU is fine. Memory looks good. Pods aren't restarting. Databases are healthy. But customers can't log in, or payments won't process, or nothing's loading.&lt;/p&gt;

&lt;p&gt;You check Prometheus. Nothing's firing. Everything says "we're fine."&lt;/p&gt;

&lt;p&gt;Except you're not fine.&lt;/p&gt;

&lt;p&gt;A healthy system is not the same as a working system.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Blind Spot
&lt;/h2&gt;

&lt;p&gt;Most monitoring setups measure what's happening inside the infrastructure.&lt;/p&gt;

&lt;p&gt;CPU utilization. Memory consumption. Disk usage. Network throughput. Pod restarts. Request rates. Error counts.&lt;/p&gt;

&lt;p&gt;These metrics matter. But they answer one question: &lt;em&gt;How are our components behaving?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Customers are asking something different: &lt;em&gt;Can I complete my task?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The gap between those two questions is where incidents hide.&lt;/p&gt;

&lt;h2&gt;
  
  
  Internal Metrics Show You The Engine
&lt;/h2&gt;

&lt;p&gt;Think of a car dashboard showing engine temperature normal, fuel level normal, oil pressure normal, battery healthy.&lt;/p&gt;

&lt;p&gt;Everything looks fine.&lt;/p&gt;

&lt;p&gt;But the steering wheel is disconnected.&lt;/p&gt;

&lt;p&gt;That's what a lot of monitoring does. We measure component health while assuming the customer journey works. Usually it does. Sometimes it doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Scenarios You Learn From
&lt;/h2&gt;

&lt;p&gt;Most teams adopt synthetic monitoring after a painful incident. The postmortem reads the same way every time:&lt;/p&gt;

&lt;p&gt;"All services were healthy."&lt;/p&gt;

&lt;p&gt;"Kubernetes showed no issues."&lt;/p&gt;

&lt;p&gt;"Database latency was normal."&lt;/p&gt;

&lt;p&gt;"But customers couldn't log in."&lt;/p&gt;

&lt;p&gt;Or:&lt;/p&gt;

&lt;p&gt;"But payments weren't processing."&lt;/p&gt;

&lt;p&gt;Or:&lt;/p&gt;

&lt;p&gt;"But they couldn't upload files."&lt;/p&gt;

&lt;p&gt;The issue wasn't invisible. You just weren't measuring it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Gets Missed
&lt;/h2&gt;

&lt;p&gt;Your API returns HTTP 200. Your authentication service is running. Your database is healthy. But the token validation fails because a certificate expired. Green dashboards. Users stuck.&lt;/p&gt;

&lt;p&gt;Or a downstream dependency fails silently. Metrics show low latency, healthy containers, no restarts. Customers get incomplete results.&lt;/p&gt;

&lt;p&gt;Or a DNS misconfiguration breaks resolution. Everything internal looks normal. Users see downtime.&lt;/p&gt;

&lt;p&gt;Or a JavaScript bug on the frontend breaks the checkout flow. Your backend is fine. Your infrastructure is fine. Users can't complete transactions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Blackbox Monitoring Actually Tests This
&lt;/h2&gt;

&lt;p&gt;Blackbox monitoring doesn't care about implementation details. It behaves like a customer.&lt;/p&gt;

&lt;p&gt;Instead of asking "Is the service running?" it asks "Can the user successfully log in? Make a payment? Upload a file? Finish a transaction?"&lt;/p&gt;

&lt;p&gt;If the infrastructure is healthy but blackbox monitoring fails, you've found your incident.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which Alert Matters More
&lt;/h2&gt;

&lt;p&gt;CPU utilization exceeded 85%.&lt;/p&gt;

&lt;p&gt;vs.&lt;/p&gt;

&lt;p&gt;Customers cannot complete checkout.&lt;/p&gt;

&lt;p&gt;The second one, obviously. Because customers don't buy CPU.&lt;/p&gt;

&lt;p&gt;The whole point of observability isn't to monitor infrastructure. It's to protect business functions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use Both
&lt;/h2&gt;

&lt;p&gt;This isn't a choice. Internal metrics and blackbox monitoring solve different problems.&lt;/p&gt;

&lt;p&gt;Internal metrics help you understand &lt;em&gt;why&lt;/em&gt; something failed. Which component is degraded. Where the bottleneck is. What engineers should investigate.&lt;/p&gt;

&lt;p&gt;Blackbox monitoring tells you &lt;em&gt;whether anyone cares yet&lt;/em&gt;. Are customers impacted? Can critical workflows succeed? Is the platform delivering value?&lt;/p&gt;

&lt;p&gt;One explains the story. The other tells you if the story matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Example
&lt;/h2&gt;

&lt;p&gt;Your streaming platform goes down. &lt;/p&gt;

&lt;p&gt;Internally:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kubernetes healthy&lt;/li&gt;
&lt;li&gt;RabbitMQ healthy
&lt;/li&gt;
&lt;li&gt;CPU normal&lt;/li&gt;
&lt;li&gt;Memory normal&lt;/li&gt;
&lt;li&gt;Databases healthy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Blackbox monitoring:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Video playback success rate: 0%&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Which alert wakes someone up? The playback failure. That's the closest thing to what your users actually experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Danger
&lt;/h2&gt;

&lt;p&gt;The most damaging outages happen when internal monitoring and customer experience tell different stories.&lt;/p&gt;

&lt;p&gt;If you only measure what's happening inside your platform, you're seeing half the picture. Your pods are healthy. Your databases are fine. Your services are running.&lt;/p&gt;

&lt;p&gt;Your users just can't do anything with them.&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>devops</category>
      <category>productivity</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Alert Fatigue Is an Architecture Problem, Not a Process Problem</title>
      <dc:creator>Mrinal Narang</dc:creator>
      <pubDate>Sat, 20 Jun 2026 10:55:17 +0000</pubDate>
      <link>https://dev.to/mrinal_narang_13a3d00eb37/alert-fatigue-is-an-architecture-problem-not-a-process-problem-3414</link>
      <guid>https://dev.to/mrinal_narang_13a3d00eb37/alert-fatigue-is-an-architecture-problem-not-a-process-problem-3414</guid>
      <description>&lt;p&gt;Every operations team gets the same advice: improve your runbooks, create better escalation policies, train engineers on incident response, tune alert thresholds. Some of it sticks. Most of it doesn't actually fix the problem.&lt;/p&gt;

&lt;p&gt;When 200 alerts fire during a single incident, the real issue isn't that your engineers lack documentation. It's that your architecture allows 200 different things to break independently.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Question Most Teams Miss
&lt;/h2&gt;

&lt;p&gt;Organizations usually ask: &lt;em&gt;How can we manage alerts better?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The better question is: &lt;em&gt;Why are there so many alerts in the first place?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Alert fatigue gets treated as an ops problem — adjust PagerDuty, refine notification rules, write more runbooks. But incidents keep generating hundreds of alerts. That's because alerts aren't the problem. They're just the symptom.&lt;/p&gt;

&lt;p&gt;The actual problem is in your system design.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Happens
&lt;/h2&gt;

&lt;p&gt;Take a customer-facing app on Kubernetes. One database latency spike.&lt;/p&gt;

&lt;p&gt;Within minutes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Application pods timeout&lt;/li&gt;
&lt;li&gt;CPU climbs as retries pile up&lt;/li&gt;
&lt;li&gt;Message queues back up&lt;/li&gt;
&lt;li&gt;API response times tank&lt;/li&gt;
&lt;li&gt;Load balancer health checks fail&lt;/li&gt;
&lt;li&gt;Autoscaling spins up new pods&lt;/li&gt;
&lt;li&gt;Those pods can't pass readiness checks&lt;/li&gt;
&lt;li&gt;Cache hit rates drop&lt;/li&gt;
&lt;li&gt;Downstream services start failing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One failure. Two hundred alerts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;40 infrastructure alerts&lt;/li&gt;
&lt;li&gt;60 application alerts&lt;/li&gt;
&lt;li&gt;30 database alerts&lt;/li&gt;
&lt;li&gt;20 queue alerts&lt;/li&gt;
&lt;li&gt;50 synthetic monitoring alerts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Did 200 systems actually fail? No. One thing broke. Your architecture just exposed it 200 different ways.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Better Documentation Won't Help
&lt;/h2&gt;

&lt;p&gt;Runbooks let people respond faster. They don't reduce the number of failure signals. If an incident throws 300 alerts at you, a great runbook just helps you navigate the noise more efficiently. It doesn't eliminate the noise.&lt;/p&gt;

&lt;p&gt;It's like putting better labels on a car's dashboard warning lights while ignoring the fact that a single engine problem triggers 30 different indicators. The labels help. The engine still needs fixing.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Matters
&lt;/h2&gt;

&lt;p&gt;Teams with mature reliability practices focus on one thing: reducing how far failures propagate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Isolation works.&lt;/strong&gt; A failing service shouldn't take down everything else. Use circuit breakers, bulkheads, service boundaries, graceful degradation. Make failures stay in their lane.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alert hierarchies matter.&lt;/strong&gt; Not every metric should alert. If the database goes down, you alert on that. If the API gets slow &lt;em&gt;because&lt;/em&gt; the database is down, that's a derivative symptom — group it with the root cause alert, don't fire it separately. Give people one actionable alert, not dozens of related noise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root cause visibility works.&lt;/strong&gt; Your observability setup should answer "what actually broke?" not "here are 150 warnings, good luck." Connect the dots so correlations are obvious.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure blast radius matters.&lt;/strong&gt; Architecture designed to contain failures generates far fewer alerts than architecture that lets one broken thing cascade everywhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Actually Measure
&lt;/h2&gt;

&lt;p&gt;Most teams track MTTR, availability, error rates, SLA compliance. Those matter. But they miss the architectural signal:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alert-to-incident ratio.&lt;/strong&gt; How many alerts per incident? 1-10 is healthy. 10-50 is a problem. 50+ means your architecture is amplifying failure signals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root cause multiplication factor.&lt;/strong&gt; One broken component shouldn't create 100 alerts. If it does, that number tells you something about your coupling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alert actionability.&lt;/strong&gt; What percentage of your alerts actually need human action? If only 5%, the other 95% is noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Issue
&lt;/h2&gt;

&lt;p&gt;Executives think alert fatigue is a staffing problem. Managers think it's a process problem. Engineers blame monitoring.&lt;/p&gt;

&lt;p&gt;Most of the time it's actually a systems design problem. Every unnecessary dependency, every tightly coupled service, every retry storm, every cascading failure mechanism adds another alert that will fire during the next incident. The monitoring system isn't broken. It's just revealing how tightly woven everything is.&lt;/p&gt;

&lt;h2&gt;
  
  
  Worth Asking
&lt;/h2&gt;

&lt;p&gt;When your team is drowning in alerts, the instinct is to improve runbooks and escalation policies. Resist that. Ask something harder:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why does a single failure become hundreds of signals?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Because each alert is telling you something. And sometimes what it's really telling you isn't about how to respond faster. It's about how the system is built.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>monitoring</category>
      <category>learning</category>
    </item>
    <item>
      <title>Please see this and share your insights on this</title>
      <dc:creator>Mrinal Narang</dc:creator>
      <pubDate>Fri, 12 Jun 2026 15:22:19 +0000</pubDate>
      <link>https://dev.to/mrinal_narang_13a3d00eb37/please-see-this-and-share-your-insights-on-this-315h</link>
      <guid>https://dev.to/mrinal_narang_13a3d00eb37/please-see-this-and-share-your-insights-on-this-315h</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/mrinal_narang_13a3d00eb37/mongodb-dr-drill-automation-with-terraform-python-jenkins-how-we-made-restores-boring-loj" class="crayons-story__hidden-navigation-link"&gt;MongoDB DR Drill Automation with Terraform, Python &amp;amp; Jenkins — How We Made Restores Boring&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/mrinal_narang_13a3d00eb37" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3981491%2F6c07d98f-ca97-4312-bc7b-40a28c3e7d8a.jpg" alt="mrinal_narang_13a3d00eb37 profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/mrinal_narang_13a3d00eb37" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Mrinal Narang
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Mrinal Narang
                
              
              &lt;div id="story-author-preview-content-3884014" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/mrinal_narang_13a3d00eb37" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3981491%2F6c07d98f-ca97-4312-bc7b-40a28c3e7d8a.jpg" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Mrinal Narang&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/mrinal_narang_13a3d00eb37/mongodb-dr-drill-automation-with-terraform-python-jenkins-how-we-made-restores-boring-loj" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;Jun 12&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/mrinal_narang_13a3d00eb37/mongodb-dr-drill-automation-with-terraform-python-jenkins-how-we-made-restores-boring-loj" id="article-link-3884014"&gt;
          MongoDB DR Drill Automation with Terraform, Python &amp;amp; Jenkins — How We Made Restores Boring
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/mongodb"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;mongodb&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/devops"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;devops&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/terraform"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;terraform&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/jenkins"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;jenkins&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/mrinal_narang_13a3d00eb37/mongodb-dr-drill-automation-with-terraform-python-jenkins-how-we-made-restores-boring-loj" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;2&lt;span class="hidden s:inline"&gt;&amp;nbsp;reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/mrinal_narang_13a3d00eb37/mongodb-dr-drill-automation-with-terraform-python-jenkins-how-we-made-restores-boring-loj#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              

              &lt;span class="hidden s:inline"&gt;Add&amp;nbsp;Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            3 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial crayons-icon c-btn__icon"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success crayons-icon c-btn__icon"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
    </item>
    <item>
      <title>MongoDB DR Drill Automation with Terraform, Python &amp; Jenkins — How We Made Restores Boring</title>
      <dc:creator>Mrinal Narang</dc:creator>
      <pubDate>Fri, 12 Jun 2026 15:21:39 +0000</pubDate>
      <link>https://dev.to/mrinal_narang_13a3d00eb37/mongodb-dr-drill-automation-with-terraform-python-jenkins-how-we-made-restores-boring-loj</link>
      <guid>https://dev.to/mrinal_narang_13a3d00eb37/mongodb-dr-drill-automation-with-terraform-python-jenkins-how-we-made-restores-boring-loj</guid>
      <description>&lt;h2&gt;
  
  
  Backups Don't Save You. Restores Do.
&lt;/h2&gt;

&lt;p&gt;We ran a MongoDB restore drill last quarter. It failed — not the restore itself, but the confidence. Nobody in the room was sure the data was actually intact. The service came back up, and we all just stared at each other.&lt;/p&gt;

&lt;p&gt;That was the problem. So we fixed it by automating everything.&lt;/p&gt;

&lt;p&gt;One Jenkins job now provisions infra, builds the replica set, restores from dumps, validates data integrity, and stores a full audit trail. Here's exactly how it works.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Goal
&lt;/h2&gt;

&lt;p&gt;Remove every manual, error-prone step from the DR process:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Identical restore flow across all environments&lt;/li&gt;
&lt;li&gt;Automated replica set setup — no manual &lt;code&gt;rs.initiate()&lt;/code&gt; typos&lt;/li&gt;
&lt;li&gt;Real validation that &lt;em&gt;proves&lt;/em&gt; data is intact, not just assumed&lt;/li&gt;
&lt;li&gt;Full audit trail for post-mortems and compliance reviews&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Pipeline: 5 Stages
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Infrastructure with Terraform
&lt;/h3&gt;

&lt;p&gt;Every drill starts with clean infra. Terraform provisions EC2s, networking, and persistent volumes from scratch — same starting point every time. No leftover state. No "works on my machine" surprises.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_instance"&lt;/span&gt; &lt;span class="s2"&gt;"mongo_node"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;count&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
  &lt;span class="nx"&gt;ami&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;mongo_ami&lt;/span&gt;
  &lt;span class="nx"&gt;instance_type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"t3.medium"&lt;/span&gt;
  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"mongo-dr-node-${count.index}"&lt;/span&gt;
    &lt;span class="nx"&gt;Role&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"mongodb-replica"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Replica Set Creation (Python)
&lt;/h3&gt;

&lt;p&gt;Instead of manually running &lt;code&gt;rs.initiate()&lt;/code&gt; and &lt;code&gt;rs.add()&lt;/code&gt; and hoping the timing works, a Python script handles the entire setup — ordering, retries, and confirmation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pymongo&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MongoClient&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;init_replica_set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;primary_host&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;secondary_hosts&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MongoClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mongodb://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;primary_host&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:27017&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rs0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;members&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;host&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
                    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;primary_host&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;secondary_hosts&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;admin&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;replSetInitiate&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Wait for PRIMARY election
&lt;/span&gt;    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;admin&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;command&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;replSetGetStatus&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stateStr&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PRIMARY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;members&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Replica set did not elect a PRIMARY in time&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Automating this removes timing issues and misconfiguration. Every replica set comes up the same way.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Backup &amp;amp; Restore
&lt;/h3&gt;

&lt;p&gt;Backups are normalized into compressed archives. The restore unpacks a dump and applies it to the fresh nodes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create dump&lt;/span&gt;
mongodump &lt;span class="nt"&gt;--host&lt;/span&gt; &lt;span class="nv"&gt;$SOURCE_HOST&lt;/span&gt; &lt;span class="nt"&gt;--db&lt;/span&gt; &lt;span class="nv"&gt;$DB_NAME&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--out&lt;/span&gt; /backup/dump &lt;span class="nt"&gt;--gzip&lt;/span&gt;

&lt;span class="c"&gt;# Restore to DR environment&lt;/span&gt;
mongorestore &lt;span class="nt"&gt;--host&lt;/span&gt; &lt;span class="nv"&gt;$DR_HOST&lt;/span&gt; &lt;span class="nt"&gt;--db&lt;/span&gt; &lt;span class="nv"&gt;$DB_NAME&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  /backup/dump/&lt;span class="nv"&gt;$DB_NAME&lt;/span&gt; &lt;span class="nt"&gt;--gzip&lt;/span&gt; &lt;span class="nt"&gt;--drop&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Validation &amp;amp; Comparison — The Part Most Teams Skip
&lt;/h3&gt;

&lt;p&gt;This is the step that actually builds confidence. The validation script:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Checks which collections exist (flags missing collections)&lt;/li&gt;
&lt;li&gt;Compares document counts collection by collection&lt;/li&gt;
&lt;li&gt;Compares indexes between source and restored DB&lt;/li&gt;
&lt;li&gt;Samples &lt;code&gt;_id&lt;/code&gt; values for obvious data mismatches
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;validate_restore&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source_uri&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dr_uri&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;db_name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;src&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MongoClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source_uri&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="n"&gt;db_name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;dr&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MongoClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dr_uri&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="n"&gt;db_name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pass&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;collections&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{}}&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;col&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;list_collection_names&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="n"&gt;src_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;count_documents&lt;/span&gt;&lt;span class="p"&gt;({})&lt;/span&gt;
        &lt;span class="n"&gt;dr_count&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dr&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;count_documents&lt;/span&gt;&lt;span class="p"&gt;({})&lt;/span&gt;
        &lt;span class="n"&gt;src_idx&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;index_information&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
        &lt;span class="n"&gt;dr_idx&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dr&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;index_information&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

        &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src_count&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;dr_count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src_idx&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;dr_idx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;collections&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;col&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;count_match&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;source_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;src_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dr_count&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="n"&gt;dr_count&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;index_match&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;src_idx&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;dr_idx&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fail&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;report&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Exit code &lt;code&gt;0&lt;/code&gt; = counts and indexes match → Jenkins passes.&lt;br&gt;
Non-zero = mismatch → Jenkins fails the build immediately.&lt;/p&gt;

&lt;p&gt;No more guessing. No more staring at each other in the war room.&lt;/p&gt;
&lt;h3&gt;
  
  
  5. Jenkins Orchestration
&lt;/h3&gt;

&lt;p&gt;Single Jenkins pipeline. Stages run sequentially, each one gated on the previous:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight groovy"&gt;&lt;code&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="n"&gt;any&lt;/span&gt;
  &lt;span class="n"&gt;stages&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;stage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Provision Infra'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;steps&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;sh&lt;/span&gt; &lt;span class="s1"&gt;'terraform init &amp;amp;&amp;amp; terraform apply -auto-approve'&lt;/span&gt;
      &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;stage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Setup Replica Set'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;steps&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;sh&lt;/span&gt; &lt;span class="s1"&gt;'python3 scripts/init_replica_set.py'&lt;/span&gt;
      &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;stage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Restore MongoDB'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;steps&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;sh&lt;/span&gt; &lt;span class="s1"&gt;'bash scripts/restore.sh'&lt;/span&gt;
      &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;stage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Validate Restore'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;steps&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;sh&lt;/span&gt; &lt;span class="s1"&gt;'python3 scripts/validate_restore.py'&lt;/span&gt;
      &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;stage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Archive Logs'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
      &lt;span class="n"&gt;steps&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;archiveArtifacts&lt;/span&gt; &lt;span class="nl"&gt;artifacts:&lt;/span&gt; &lt;span class="s1"&gt;'reports/*.json, logs/*.log'&lt;/span&gt;
      &lt;span class="o"&gt;}&lt;/span&gt;
    &lt;span class="o"&gt;}&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every run is logged, every report is archived. When auditors ask if restores work — you show them a report with timestamps, counts, and index diffs. Not a gut feeling.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lessons Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Automate infra, not just the restore.&lt;/strong&gt; Terraform gives you a clean slate every drill. Manual infra setup introduces variability that hides real problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Validation is not optional.&lt;/strong&gt; A restore that "seems fine" is not the same as a restore that &lt;em&gt;is&lt;/em&gt; fine. Document count mismatches and missing indexes are easy to catch automatically and impossible to catch by eyeballing logs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Logs equal trust.&lt;/strong&gt; The audit trail is what makes your DR process credible to others — engineers, management, auditors. Without it, you're asking people to take your word for it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Minimal input reduces errors.&lt;/strong&gt; We trimmed required inputs to just host + DB name and let scripts infer the rest. Less to type = fewer mistakes under pressure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practice makes permanent.&lt;/strong&gt; Each drill found a small improvement. After ten drills, the process was genuinely fast and boring — which is exactly what you want.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Outcome
&lt;/h2&gt;

&lt;p&gt;We went from a 3-hour manual war room exercise to a single Jenkins job anyone can trigger. The drills are now predictable, repeatable, and quick.&lt;/p&gt;

&lt;p&gt;More importantly — everyone on the team &lt;em&gt;believes&lt;/em&gt; the restores work, because the validation script proves it every single time.&lt;/p&gt;

&lt;p&gt;Boring DR is good DR.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Running MongoDB in production? When did you last drill a full restore? Drop your setup in the comments — curious how teams handle validation.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>mongodb</category>
      <category>devops</category>
      <category>terraform</category>
      <category>jenkins</category>
    </item>
    <item>
      <title>How Kong's Control Plane / Data Plane Split Cut Our Gateway Costs by 34% (And Made It a Security Layer)</title>
      <dc:creator>Mrinal Narang</dc:creator>
      <pubDate>Fri, 12 Jun 2026 15:13:19 +0000</pubDate>
      <link>https://dev.to/mrinal_narang_13a3d00eb37/how-kongs-control-plane-data-plane-split-cut-our-gateway-costs-by-34-and-made-it-a-security-1754</link>
      <guid>https://dev.to/mrinal_narang_13a3d00eb37/how-kongs-control-plane-data-plane-split-cut-our-gateway-costs-by-34-and-made-it-a-security-1754</guid>
      <description>&lt;h2&gt;
  
  
  The Problem With How Most Teams Run Kong
&lt;/h2&gt;

&lt;p&gt;If you set up Kong the default way, everything lives together — routing, policy enforcement, plugin execution, live traffic handling. One deployment doing all the things.&lt;/p&gt;

&lt;p&gt;It works. Until it doesn't.&lt;/p&gt;

&lt;p&gt;When traffic spikes, you scale up. But you're scaling the control plane too, which barely does anything at runtime. You're paying compute for config management that gets touched only when something changes — not on every request.&lt;/p&gt;

&lt;p&gt;That was us. Scaling more than we needed to, paying for it, and not realizing why.&lt;/p&gt;




&lt;h2&gt;
  
  
  Splitting Control Plane from Data Plane
&lt;/h2&gt;

&lt;p&gt;The data plane is hot. It handles every live request, every millisecond, 24/7. It needs to be fast, lean, and close to your services.&lt;/p&gt;

&lt;p&gt;The control plane is cold. It pushes config — route definitions, plugin settings, policy changes. It fires when something changes, then sits quiet.&lt;/p&gt;

&lt;p&gt;When you separate them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data plane scales with your actual traffic&lt;/li&gt;
&lt;li&gt;Control plane runs small and cheap, sized for config ops not request volume&lt;/li&gt;
&lt;li&gt;You stop paying for compute you're not using&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That architectural change alone dropped our gateway infra cost by &lt;strong&gt;34%&lt;/strong&gt;. No feature removal. No degraded performance. Just stop running one thing at the scale of another.&lt;/p&gt;




&lt;h2&gt;
  
  
  Then We Added Plugins — And Kong Became Something Else
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting. Once the infra is clean, you can actually think about what Kong &lt;em&gt;should&lt;/em&gt; be doing for your stack.&lt;/p&gt;

&lt;h3&gt;
  
  
  JWT Validation at the Gateway
&lt;/h3&gt;

&lt;p&gt;Every request carries a token. Kong verifies it before the request gets anywhere near a service. No valid token, request dies at the edge.&lt;/p&gt;

&lt;p&gt;Your services stop writing auth logic entirely. No more 12 slightly different JWT implementations across 12 services. One place, one standard, enforced consistently.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;plugins&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;jwt&lt;/span&gt;
  &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;secret_is_base64&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
    &lt;span class="na"&gt;claims_to_verify&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;exp&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  OAuth 2.0 for Third-Party Integrations
&lt;/h3&gt;

&lt;p&gt;Handled at the gateway, not scattered across services. External partners authenticate once at the edge. Your internal services never see unauthenticated traffic.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rate Limiting Per Consumer, Not Just Per Route
&lt;/h3&gt;

&lt;p&gt;This is the one most teams miss. Route-level rate limiting is blunt. Consumer-level rate limiting is precise.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;plugins&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rate-limiting&lt;/span&gt;
  &lt;span class="na"&gt;consumer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;free-tier&lt;/span&gt;
  &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;minute&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;rate-limiting&lt;/span&gt;
  &lt;span class="na"&gt;consumer&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;enterprise&lt;/span&gt;
  &lt;span class="na"&gt;config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;minute&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same plugin, same gateway, different policy per JWT claim. Free tier gets 100 req/min. Enterprise gets 10,000. Zero application code involved.&lt;/p&gt;

&lt;h3&gt;
  
  
  Request Transformation
&lt;/h3&gt;

&lt;p&gt;Strip headers you don't want passing through to services. Inject headers your services expect. Normalize payloads from external partners sending data in formats your team didn't design for — all before the request touches your backend.&lt;/p&gt;

&lt;h3&gt;
  
  
  IP Whitelisting on Internal Routes
&lt;/h3&gt;

&lt;p&gt;Certain paths accessible only from known sources. One config block. Applies across the entire stack.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Actually Changed
&lt;/h2&gt;

&lt;p&gt;Before: auth logic lived in every service. Every team implemented it differently. Every security audit found inconsistencies. Every new service started from scratch building things that had already been built six times.&lt;/p&gt;

&lt;p&gt;After: the gateway owns identity, rate policy, request shape, and access control. Services own business logic. That boundary is clean and it stays clean.&lt;/p&gt;

&lt;p&gt;When we did a security audit post-migration, the findings dropped significantly. Not because we wrote better application code — we hadn't touched it. Because we moved the security surface to one place and made it consistent.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture That Came Out of This
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;External Traffic
      │
      ▼
[Kong Data Plane]  ◄──── [Kong Control Plane] (small, separate, cheap)
      │                         │
  JWT auth                   Config push
  Rate limiting               Plugin management
  Request transform           Route definitions
  IP whitelist
      │
      ▼
[Your Services]  ←── Business logic only
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The data plane is the only thing in the hot path. The control plane is a config server. Your services are finally just services.&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Split control plane from data plane → stop scaling what doesn't need to scale&lt;/li&gt;
&lt;li&gt;JWT at the gateway → services never handle auth again&lt;/li&gt;
&lt;li&gt;Per-consumer rate limiting → fine-grained control without application changes&lt;/li&gt;
&lt;li&gt;Request transformation → normalize at the edge, not inside your code&lt;/li&gt;
&lt;li&gt;One security surface → consistent, auditable, maintainable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're running Kong as a glorified reverse proxy, you're leaving most of its value on the table.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>apigateway</category>
      <category>kubernetes</category>
      <category>security</category>
    </item>
  </channel>
</rss>
