DEV Community: Mrinal Narang

Two Kubernetes Decisions Nobody Writes About Honestly

Mrinal Narang — Tue, 30 Jun 2026 03:14:00 +0000

1. Node Group Sizing

Fewer large nodes vs. many small ones. Textbooks don't cover this.

We ran 10 nodes, 32 CPU each. Seemed efficient.

Problem: One node dies, 320 CPU worth of workloads need to reschedule. Cluster autoscaler couldn't handle it. Pods sat pending for 10 minutes.

We switched to 20 nodes, 16 CPU each. Same total capacity.

One node dies now? 160 CPU to reschedule. Autoscaler catches it in 90 seconds. Scheduling is tighter, but failures are isolated.

Cost stayed the same. Blast radius halved.

Why nobody writes about this: The tradeoff isn't obvious. Large nodes are "more efficient." Smaller nodes are "more resilient." Both are true. It depends on whether you'd rather have one big problem or many small ones.

We picked smaller nodes because a node failure was our actual failure mode. Not resource efficiency.

2. Readiness vs Liveness Probes

Misconfigure these and your cluster looks like it's melting.

Readiness probe: "Can this pod take traffic?"

Liveness probe: "Is this pod alive? Restart it if not."

One team set readiness = liveness. Same probe checked both.

Probe logic: "If I can reach the database, I'm ready."

Database gets slow. Probe fails. Pod becomes "not ready." Load balancer removes it from rotation (correct).

But liveness also failed. Kubernetes killed the pod and restarted it.

New pod starts. Probe fails immediately (database still slow). Gets killed. Restarted.

This cascaded across 30 pods. 30 restarts/minute. New pods spent 100% of time restarting.

Looks like an application bug for the first 20 minutes. Actually a probe configuration bug.

Fix: Readiness checks "can I take traffic right now?" Liveness checks "am I fundamentally broken?" Use separate probes.

Readiness: Check database connection with short timeout. Fail if slow. This is reasonable - don't send traffic to slow pods.

Liveness: Check if the process is responding at all. Much stricter threshold. Only kill if truly hung.

Same database slowness? Pods become unready. Traffic reroutes. Cluster stabilizes. No restart loop.

The Connection

Both decisions have hidden failure modes that surface under stress.

Node sizing looks good until a node fails and you realize your blast radius is too large.

Probe configuration looks good until the database gets slow and your whole cluster starts thrashing.

Neither is "wrong." Both have tradeoffs. The difference is understanding the failure mode you're actually optimizing for.

Large nodes are efficient until they aren't.

Readiness = Liveness is simple until cascading restarts make it obvious.

If you're running EKS with 8+ large nodes, consider downsizing and multiplying. If you've never hit a restart loop from probe misconfiguration, you will eventually.

When you do, check if readiness and liveness are the same. Usually they are.

Kubernetes #EKS #DevOps #SRE #IncidentResponse

Blameless Postmortems in Practice

Mrinal Narang — Mon, 29 Jun 2026 03:07:00 +0000

Most teams claim they do blameless postmortems.

Then the incident happens.

"Jane didn't validate the input."

"The on-call missed the alert."

"We should have caught this in code review."

That's blame. It's just dressed up in process language.

The Gap

Blameless postmortems aren't about ignoring human error. They're about understanding why a reasonable person made a decision that, in hindsight, was wrong.

The question isn't: "What did Jane do wrong?"

It's: "What made Jane's action seem reasonable at the time?"

If you can't answer the second question, your postmortem isn't blameless. It's just performative.

What Actually Happens

Blameless postmortem (real):

"The deployment happened without running tests. Why?

The test environment was down for maintenance.
Nobody documented which environment Jane should use instead.
It was 11 PM on a Friday.
Jane has deployed 200 times without incident.
The process allowed skipping tests if 'urgent.'

So we added automated test gates that can't be bypassed. We documented the backup environment. We made urgent deployments require two people."

Blamed postmortem (disguised):

"The deployment happened without running tests.

Root cause: Insufficient process discipline.

Action item: Remind team to follow procedures."

One actually changes behavior. One just documents that someone messed up.

The Test

Read your last three postmortems.

Count how many times you see:

"Person X should have..."
"We should have caught..."
"Insufficient discipline..."
"Better communication would have..."

If the focus is on what people should do differently, you're not doing blameless postmortems. You're doing blame with better language.

Real blameless postmortems focus on:

What system allowed this to happen?
What information was missing?
What would have made the better decision obvious?
What tool could have caught this?

The Shift That Matters

Blame mindset: "How do we stop people from doing this?"

Blameless mindset: "How do we build systems where the wrong decision is harder than the right one?"

Example:

Blame: "The engineer deployed without approval."

Action: "Require manual approvals before deployment."

Result: Engineers find workarounds. Deployments slow. Nothing changes.

Blameless: "The engineer deployed without approval. Why did that seem reasonable?"

Answer: "The approval process was taking 2 hours, and the customer issue was urgent. The engineer bypassed it."

Action: "Implement auto-approval for critical hotfixes if all tests pass."

Result: Urgent deployments don't require workarounds. Actual behavior changes.

The Questions That Reveal Blame

"Why did the on-call miss the alert?"

vs.

"Why didn't the on-call see the alert? Was the alert buried in noise? Was the alert configured wrong? Was the on-call context-switching too much?"

First question assumes blame. Second question discovers systems.

"The engineer didn't validate input."

vs.

"Why wasn't input validation enforced at the framework level? Why didn't the linter catch this? Why was this pattern possible?"

First question is about the engineer. Second question is about the system.

What Actually Works

Document the decision-making context. Not judgment.

"The engineer believed the data was validated upstream" is context.

"The engineer was careless" is judgment.

Ask: "If this exact situation happened tomorrow, would the same decision seem reasonable to a competent person?"

If yes, it's a system problem. Fix the system.

If no, you've found something else.

The Honest Part

Real blameless postmortems are harder than blamed ones.

It's easier to say "Person did bad thing" than to trace the systems that made the bad thing seem reasonable.

It requires admitting that your process enabled the failure.

It requires changing things instead of just documenting them.

But it's the only approach that actually changes behavior.

Teams that claim "blameless" but still use postmortems as accountability theater don't fix anything. They just have better documentation of blame.

Teams that actually ask "why would a reasonable person make this decision?" build systems where the failures stop happening.

Check your last postmortem. What were the action items?

If they're mostly about "team discipline" or "better communication," you're doing blame with better language.

If they're about systems, tools, and removing friction from the right path, you're actually being blameless.

DevOps #IncidentResponse #Postmortem #Blameless #TeamCulture #SRE

Scaling Cooldown Tuning: Stop Your Autoscaler From Thrashing

Mrinal Narang — Mon, 29 Jun 2026 02:57:00 +0000

Your HPA is flapping.

Pods spin up. Traffic dips. Pods spin down. Traffic returns. Pods spin up again. All within 90 seconds.

This costs money and stability. Every scale event creates pod churn. New pods need to warm up. Connections restart. Metrics refresh.

The fix isn't complicated. It's tuning cooldown periods.

What Flapping Looks Like

Before tuning:

9:15 AM: CPU hits 75%. HPA scales 3→5 pods.
9:16 AM: Traffic normalizes. CPU drops to 60%.
9:17 AM: HPA scales 5→3 pods (scaleDown default is 300s, but we weren't respecting it).
9:18 AM: Batch request comes in. CPU jumps to 80%.
9:19 AM: HPA scales 3→5 pods again.

Every 60-90 seconds. Constantly. Pod logs show connection resets every minute.

Billing spike? $200/day in unnecessary compute because pods kept restarting.

The Tuning

We changed from defaults:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300    # wait 5 min before scaling down
    policies:
    - type: Percent
      value: 50
      periodSeconds: 60
  scaleUp:
    stabilizationWindowSeconds: 0      # scale up immediately
    policies:
    - type: Percent
      value: 100
      periodSeconds: 60

Why these values?

scaleDown stabilizationWindow: 300s (5 min). If CPU drops below threshold, wait 5 minutes before actually scaling down. Most traffic spikes last longer than 90 seconds. This prevents reacting to temporary dips. One team tried 60s, still flapping. 300s worked.

scaleDown percent: 50. Remove half the pods at a time, not all of them. If you're at 5 pods and scale down to 3, you're making a big bet that you don't need those 2. Removing 50% (5→3) is safer than removing 100%.

scaleUp stabilizationWindow: 0. When CPU hits 75%, scale immediately. You have customers waiting. Slow scale-up means slow response time.

scaleUp percent: 100. Double the pod count if needed. If you're at 3 pods and hitting limits, jump to 6. Better to overprovision briefly than make customers wait.

After Tuning

Same 9:15 AM scenario:

9:15 AM: CPU hits 75%. HPA scales 3→5 pods immediately.
9:20 AM: Traffic stabilizes. System waits (stabilization window).
9:25 AM: CPU still below 60%. HPA scales 5→3 pods.
9:26 AM: No more thrashing.

Pod restart rate dropped 95%. Load balancer connection resets went from 60/min to 2/min.

Monthly compute cost dropped $1,400 (was $8,500/month due to churn, now $7,100).

The Principle

Scale up fast, scale down slow.

Customers need capacity now. They don't care if you have extra pods for 5 minutes. They do care if you're constantly churning them.

Stabilization windows let temporary spikes and dips pass without action. Percent-based scaling lets you adjust gradually instead of binary yes/no decisions.

One team still uses defaults. They have pod churn every 90 seconds. Another adjusted to these values and saw pod churn once per day, only when actual traffic patterns genuinely changed.

Your HPA is probably thrashing. Check your stabilization windows. If you see pod restart spikes that correlate with CPU threshold crossings, you've found it.

Set scaleDown to 300s. Set scaleUp to 0. Adjust percents based on your app. Test. Most teams see 70-80% reduction in unnecessary scaling events.

Kubernetes #Autoscaling #HPA #CostOptimization #DevOps

Dependency Mapping and Hidden Failure Modes

Mrinal Narang — Sun, 28 Jun 2026 02:57:00 +0000

You've got your architecture diagram.

It looks good. Services connected with clear lines. Data flows. Integration points.

Solid design.

Then production goes down.

And the outage spreads through a dependency nobody drew on that diagram.

The Reality

Most outages don't follow the architecture diagram. They follow the actual code.

You have a service that calls Service A. Service A calls Service B synchronously. Service B reads from a cache. That cache is backed by Service C. Service C has an undocumented polling relationship with Service D.

Nowhere on your diagram.

But when D fails? The entire stack goes down. In order: C gets slow, B times out, A gets backed up, your service drowns in connection timeouts.

Customers notice before your alerts fire.

What Gets Missed

Implicit dependencies. Service A doesn't explicitly call Service B. But A reads from a table that B populates. If B stops writing, A fails silently. Nobody knew they were coupled.

Transitive failures. You know you depend on the database. What you don't know is that the database client library maintains a background connection pool that hits an internal service. That service goes down. Your database works fine. Your application hangs.

Async failures hidden as success. A request succeeds, returns 200. But a background job that's supposed to process the data never fires. The dependency broke, you didn't notice for hours.

Shared infrastructure you forgot about. Two services running on the same Kubernetes node. One burns CPU, the other starves. You didn't plan for them to interfere. They do.

Third-party API cascades. Your service integrates with an API that calls another API internally. When that internal API is slow, your service times out. You didn't know about the dependency. The API provider didn't document it.

How You Actually Discover Dependencies

You don't discover them during planning sessions. You discover them during incidents.

2 AM. Everything is burning. You start tracing requests. You find a call you didn't know existed. You look at the code. "Oh. Yeah. Service X calls Service Y as a fire-and-forget."

You knew about Service X. You knew about Service Y. You didn't know they were connected.

By the time you're discovering this, customers have been down for 40 minutes.

The Tools Help But Don't Solve It

Network traffic analysis shows connections. Distributed tracing reveals call chains. APM tools map service interactions.

These help. But they only show you what's currently happening. If a dependency is dormant, it's invisible. If a failure path is rare, you won't see it until it happens.

A service that calls another service only during payment processing won't show up in your dependency map until someone tries to make a payment during an outage.

What Actually Works

Run incidents. Deliberately. Gamedays and chaos engineering aren't about proving resilience. They're about discovering unknown dependencies before they become production incidents. Shut down a service you think is non-critical and watch what breaks.

Trace the data, not the diagram. Follow what happens to a customer request. Where does it go? What systems read the results? What systems depend on side effects? Write it down. That's your actual architecture.

Check what you're not monitoring. If you're not alerting on a dependency, you probably don't know about it. Set a timer. Pick a random service. Ask: what would break if this disappeared right now? If you don't know, you've found a hidden dependency.

Document after incidents. The postmortem is the best time to update your architecture diagram. You now know something that wasn't documented before. Write it down so the next person doesn't learn it during an outage.

Assume cache failures. Every cache hit is a hidden dependency. Every background job is a failure mode. Every async operation is a silent failure waiting to happen. Don't assume these are optional.

The Honest Answer

You can't map every dependency. Some are emergent properties of how systems interact. Some only become relevant during specific failure scenarios.

But you can discover them faster.

Run incidents before production does. Trace requests end-to-end. Alert on the things you're not expecting to fail. When something breaks, update your diagram.

Most outages spread through things you didn't know existed. The goal isn't to prevent that.

It's to find out what you don't know before the customers do.

DevOps #SRE #Architecture #IncidentResponse #Systems #Observability

Kubernetes Cost Optimization: Stop Buying Compute You Never Needed

Mrinal Narang — Thu, 25 Jun 2026 11:08:00 +0000

Ask teams how they're reducing Kubernetes costs and you hear: Spot Instances, autoscaling, Reserved Instances, Graviton.

All worthwhile.

But here's what I've found actually works:

Stop paying for resources workloads never use.

The Real Problem

Most clusters are carrying years of operational assumptions.

"Add some buffer." "Double the memory just in case." "Optimize later."

Months later, those assumptions become production reality.

And production reality becomes a monthly invoice.

What Actually Happens

Compare what pods request versus what they use:

Services requesting 2 CPU consuming 200m
Apps requesting 4 GB RAM consuming 800 MB
Workloads requesting 8 GB using less than 1.5 GB

Kubernetes reserves node capacity for resources that are never used. Extra nodes get provisioned. Not because applications need them. Because requests claim they do.

One team reduced their monthly bill by $40,000 just by bringing pod requests in line with actual usage.

No new technology. No architecture changes. Just honesty.

The Other Hidden Savings

Every cluster has workloads nobody needs. Pods that haven't processed meaningful traffic in months. Legacy integrations kept alive "just in case."

Reporting jobs that run 24/7 but only need 8 hours. Data processing jobs running overnight despite having no users.

Scheduling workloads to match actual demand? Teams save 30-40% on cluster costs.

What Doesn't Help

Autoscaling doesn't fix bad sizing. If pods start oversized, autoscaling just scales the oversized pods. Costs scale with the oversized assumptions.

Resource limits set during panic rarely get revisited. The emergency passes. The oversized limits stay. Years later, you're still paying for a worst-case scenario.

What Actually Works

Use Grafana, Prometheus, Metrics Server, Kubecost.

Compare requests versus usage. Check pod activity patterns. Review scaling behavior. Look at which services consume capacity but deliver little value.

The data usually tells a very different story than assumptions.

The Simple Question

If every workload had to justify its resource requests today, how many would survive unchanged?

That's usually where the real savings begin.

Kubernetes #CostOptimization #DevOps #CloudEngineering #AWS

Dashboard Design for Incident Response

Mrinal Narang — Wed, 24 Jun 2026 11:03:00 +0000

Most dashboards answer one question: Is everything okay?

During an incident, nobody's asking that.

The real question: What broke, where, and what changed?

Most dashboards fail at incidents because they were built for monitoring, not troubleshooting.

The Problem

A typical dashboard shows CPU, memory, disk, network, requests, uptime.

Useful for routine checks.

During an outage? Just noise.

You're not looking for reassurance. You're looking for evidence.

Two Different Jobs

Most teams put everything on one dashboard. That's a compromise that doesn't work for either job.

Monitoring dashboard: Is the platform healthy? SLAs being met? Resources used correctly?

Incident dashboard: What failed? When? What changed? Where do I look next?

Same tools, different purposes.

What Works During an Outage

Error rate front and center. 5XX errors, exceptions, failed transactions. Failures tell the story faster than CPU metrics.

Timeline on the graph. Mark deployments, infrastructure changes, scaling events. Most incidents start right after something changed. Make this visible in one second.

Dependency health. A healthy app talking to a dead database is not healthy. Dependencies often point to root cause faster than app metrics.

Golden signals. Latency, traffic, errors, saturation. These beat hundreds of infrastructure metrics.

Logs visible. Top exceptions, error spikes, failed endpoints. Reduce tab-switching during incidents.

Service map. Which services depend on the failing one? Visual dependency maps answer this instantly.

Alert state. Which alerts fired? Which started first? First alert usually beats alert #100 for root cause.

The Test

For every panel: How does this help me resolve the incident faster?

If the answer isn't obvious, remove it.

Example

EKS outage. Don't show cluster CPU and memory.

Show:

Failed requests by service
Pod restarts
Readiness failures
Recent deployments
HPA scaling events
Dependency latency
Top exceptions
Queue backlogs

One tells you the cluster exists. The other helps you fix it.

The Point

Monitoring dashboards tell you something broke.

Incident dashboards help you figure out why.

During an outage, only the second one matters.

DevOps #SRE #Monitoring #Kubernetes #IncidentResponse #Dashboard

Blackbox Monitoring vs Internal Metrics - The Gap Between "Healthy" and "Working"

Mrinal Narang — Sun, 21 Jun 2026 11:00:00 +0000

You've probably had this incident. Dashboards are all green. CPU is fine. Memory looks good. Pods aren't restarting. Databases are healthy. But customers can't log in, or payments won't process, or nothing's loading.

You check Prometheus. Nothing's firing. Everything says "we're fine."

Except you're not fine.

A healthy system is not the same as a working system.

The Blind Spot

Most monitoring setups measure what's happening inside the infrastructure.

CPU utilization. Memory consumption. Disk usage. Network throughput. Pod restarts. Request rates. Error counts.

These metrics matter. But they answer one question: How are our components behaving?

Customers are asking something different: Can I complete my task?

The gap between those two questions is where incidents hide.

Internal Metrics Show You The Engine

Think of a car dashboard showing engine temperature normal, fuel level normal, oil pressure normal, battery healthy.

Everything looks fine.

But the steering wheel is disconnected.

That's what a lot of monitoring does. We measure component health while assuming the customer journey works. Usually it does. Sometimes it doesn't.

The Scenarios You Learn From

Most teams adopt synthetic monitoring after a painful incident. The postmortem reads the same way every time:

"All services were healthy."

"Kubernetes showed no issues."

"Database latency was normal."

"But customers couldn't log in."

Or:

"But payments weren't processing."

Or:

"But they couldn't upload files."

The issue wasn't invisible. You just weren't measuring it.

What Gets Missed

Your API returns HTTP 200. Your authentication service is running. Your database is healthy. But the token validation fails because a certificate expired. Green dashboards. Users stuck.

Or a downstream dependency fails silently. Metrics show low latency, healthy containers, no restarts. Customers get incomplete results.

Or a DNS misconfiguration breaks resolution. Everything internal looks normal. Users see downtime.

Or a JavaScript bug on the frontend breaks the checkout flow. Your backend is fine. Your infrastructure is fine. Users can't complete transactions.

Blackbox Monitoring Actually Tests This

Blackbox monitoring doesn't care about implementation details. It behaves like a customer.

Instead of asking "Is the service running?" it asks "Can the user successfully log in? Make a payment? Upload a file? Finish a transaction?"

If the infrastructure is healthy but blackbox monitoring fails, you've found your incident.

Which Alert Matters More

CPU utilization exceeded 85%.

vs.

Customers cannot complete checkout.

The second one, obviously. Because customers don't buy CPU.

The whole point of observability isn't to monitor infrastructure. It's to protect business functions.

Use Both

This isn't a choice. Internal metrics and blackbox monitoring solve different problems.

Internal metrics help you understand why something failed. Which component is degraded. Where the bottleneck is. What engineers should investigate.

Blackbox monitoring tells you whether anyone cares yet. Are customers impacted? Can critical workflows succeed? Is the platform delivering value?

One explains the story. The other tells you if the story matters.

Real Example

Your streaming platform goes down.

Internally:

Kubernetes healthy
RabbitMQ healthy
CPU normal
Memory normal
Databases healthy

Blackbox monitoring:

Video playback success rate: 0%

Which alert wakes someone up? The playback failure. That's the closest thing to what your users actually experience.

The Danger

The most damaging outages happen when internal monitoring and customer experience tell different stories.

If you only measure what's happening inside your platform, you're seeing half the picture. Your pods are healthy. Your databases are fine. Your services are running.

Your users just can't do anything with them.

Alert Fatigue Is an Architecture Problem, Not a Process Problem

Mrinal Narang — Sat, 20 Jun 2026 10:55:17 +0000

Every operations team gets the same advice: improve your runbooks, create better escalation policies, train engineers on incident response, tune alert thresholds. Some of it sticks. Most of it doesn't actually fix the problem.

When 200 alerts fire during a single incident, the real issue isn't that your engineers lack documentation. It's that your architecture allows 200 different things to break independently.

The Question Most Teams Miss

Organizations usually ask: How can we manage alerts better?

The better question is: Why are there so many alerts in the first place?

Alert fatigue gets treated as an ops problem — adjust PagerDuty, refine notification rules, write more runbooks. But incidents keep generating hundreds of alerts. That's because alerts aren't the problem. They're just the symptom.

The actual problem is in your system design.

What Actually Happens

Take a customer-facing app on Kubernetes. One database latency spike.

Within minutes:

Application pods timeout
CPU climbs as retries pile up
Message queues back up
API response times tank
Load balancer health checks fail
Autoscaling spins up new pods
Those pods can't pass readiness checks
Cache hit rates drop
Downstream services start failing

One failure. Two hundred alerts:

40 infrastructure alerts
60 application alerts
30 database alerts
20 queue alerts
50 synthetic monitoring alerts

Did 200 systems actually fail? No. One thing broke. Your architecture just exposed it 200 different ways.

Why Better Documentation Won't Help

Runbooks let people respond faster. They don't reduce the number of failure signals. If an incident throws 300 alerts at you, a great runbook just helps you navigate the noise more efficiently. It doesn't eliminate the noise.

It's like putting better labels on a car's dashboard warning lights while ignoring the fact that a single engine problem triggers 30 different indicators. The labels help. The engine still needs fixing.

What Actually Matters

Teams with mature reliability practices focus on one thing: reducing how far failures propagate.

Isolation works. A failing service shouldn't take down everything else. Use circuit breakers, bulkheads, service boundaries, graceful degradation. Make failures stay in their lane.

Alert hierarchies matter. Not every metric should alert. If the database goes down, you alert on that. If the API gets slow because the database is down, that's a derivative symptom — group it with the root cause alert, don't fire it separately. Give people one actionable alert, not dozens of related noise.

Root cause visibility works. Your observability setup should answer "what actually broke?" not "here are 150 warnings, good luck." Connect the dots so correlations are obvious.

Failure blast radius matters. Architecture designed to contain failures generates far fewer alerts than architecture that lets one broken thing cascade everywhere.

What to Actually Measure

Most teams track MTTR, availability, error rates, SLA compliance. Those matter. But they miss the architectural signal:

Alert-to-incident ratio. How many alerts per incident? 1-10 is healthy. 10-50 is a problem. 50+ means your architecture is amplifying failure signals.

Root cause multiplication factor. One broken component shouldn't create 100 alerts. If it does, that number tells you something about your coupling.

Alert actionability. What percentage of your alerts actually need human action? If only 5%, the other 95% is noise.

The Real Issue

Executives think alert fatigue is a staffing problem. Managers think it's a process problem. Engineers blame monitoring.

Most of the time it's actually a systems design problem. Every unnecessary dependency, every tightly coupled service, every retry storm, every cascading failure mechanism adds another alert that will fire during the next incident. The monitoring system isn't broken. It's just revealing how tightly woven everything is.

Worth Asking

When your team is drowning in alerts, the instinct is to improve runbooks and escalation policies. Resist that. Ask something harder:

Why does a single failure become hundreds of signals?

Because each alert is telling you something. And sometimes what it's really telling you isn't about how to respond faster. It's about how the system is built.

Please see this and share your insights on this

Mrinal Narang — Fri, 12 Jun 2026 15:22:19 +0000

Mrinal Narang

Jun 12

MongoDB DR Drill Automation with Terraform, Python & Jenkins — How We Made Restores Boring

#mongodb #devops #terraform #jenkins

3 min read

MongoDB DR Drill Automation with Terraform, Python & Jenkins — How We Made Restores Boring

Mrinal Narang — Fri, 12 Jun 2026 15:21:39 +0000

Backups Don't Save You. Restores Do.

We ran a MongoDB restore drill last quarter. It failed — not the restore itself, but the confidence. Nobody in the room was sure the data was actually intact. The service came back up, and we all just stared at each other.

That was the problem. So we fixed it by automating everything.

One Jenkins job now provisions infra, builds the replica set, restores from dumps, validates data integrity, and stores a full audit trail. Here's exactly how it works.

The Goal

Remove every manual, error-prone step from the DR process:

Identical restore flow across all environments
Automated replica set setup — no manual rs.initiate() typos
Real validation that proves data is intact, not just assumed
Full audit trail for post-mortems and compliance reviews

The Pipeline: 5 Stages

1. Infrastructure with Terraform

Every drill starts with clean infra. Terraform provisions EC2s, networking, and persistent volumes from scratch — same starting point every time. No leftover state. No "works on my machine" surprises.

resource "aws_instance" "mongo_node" {
  count         = 3
  ami           = var.mongo_ami
  instance_type = "t3.medium"
  tags = {
    Name = "mongo-dr-node-${count.index}"
    Role = "mongodb-replica"
  }
}

2. Replica Set Creation (Python)

Instead of manually running rs.initiate() and rs.add() and hoping the timing works, a Python script handles the entire setup — ordering, retries, and confirmation.

from pymongo import MongoClient
import time

def init_replica_set(primary_host, secondary_hosts):
    client = MongoClient(f"mongodb://{primary_host}:27017")
    config = {
        "_id": "rs0",
        "members": [{"_id": i, "host": h}
                    for i, h in enumerate([primary_host] + secondary_hosts)]
    }
    client.admin.command("replSetInitiate", config)
    # Wait for PRIMARY election
    for _ in range(30):
        status = client.admin.command("replSetGetStatus")
        if any(m["stateStr"] == "PRIMARY" for m in status["members"]):
            return True
        time.sleep(2)
    raise Exception("Replica set did not elect a PRIMARY in time")

Automating this removes timing issues and misconfiguration. Every replica set comes up the same way.

3. Backup & Restore

Backups are normalized into compressed archives. The restore unpacks a dump and applies it to the fresh nodes:

# Create dump
mongodump --host $SOURCE_HOST --db $DB_NAME \
  --out /backup/dump --gzip

# Restore to DR environment
mongorestore --host $DR_HOST --db $DB_NAME \
  /backup/dump/$DB_NAME --gzip --drop

4. Validation & Comparison — The Part Most Teams Skip

This is the step that actually builds confidence. The validation script:

Checks which collections exist (flags missing collections)
Compares document counts collection by collection
Compares indexes between source and restored DB
Samples _id values for obvious data mismatches

def validate_restore(source_uri, dr_uri, db_name):
    src = MongoClient(source_uri)[db_name]
    dr  = MongoClient(dr_uri)[db_name]

    report = {"status": "pass", "collections": {}}

    for col in src.list_collection_names():
        src_count = src[col].count_documents({})
        dr_count  = dr[col].count_documents({})
        src_idx   = sorted(src[col].index_information().keys())
        dr_idx    = sorted(dr[col].index_information().keys())

        match = (src_count == dr_count) and (src_idx == dr_idx)
        report["collections"][col] = {
            "count_match":  match,
            "source_count": src_count,
            "dr_count":     dr_count,
            "index_match":  src_idx == dr_idx
        }
        if not match:
            report["status"] = "fail"

    return report

Exit code 0 = counts and indexes match → Jenkins passes.
Non-zero = mismatch → Jenkins fails the build immediately.

No more guessing. No more staring at each other in the war room.

5. Jenkins Orchestration

Single Jenkins pipeline. Stages run sequentially, each one gated on the previous:

pipeline {
  agent any
  stages {
    stage('Provision Infra') {
      steps {
        sh 'terraform init && terraform apply -auto-approve'
      }
    }
    stage('Setup Replica Set') {
      steps {
        sh 'python3 scripts/init_replica_set.py'
      }
    }
    stage('Restore MongoDB') {
      steps {
        sh 'bash scripts/restore.sh'
      }
    }
    stage('Validate Restore') {
      steps {
        sh 'python3 scripts/validate_restore.py'
      }
    }
    stage('Archive Logs') {
      steps {
        archiveArtifacts artifacts: 'reports/*.json, logs/*.log'
      }
    }
  }
}

Every run is logged, every report is archived. When auditors ask if restores work — you show them a report with timestamps, counts, and index diffs. Not a gut feeling.

Lessons Learned

Automate infra, not just the restore. Terraform gives you a clean slate every drill. Manual infra setup introduces variability that hides real problems.

Validation is not optional. A restore that "seems fine" is not the same as a restore that is fine. Document count mismatches and missing indexes are easy to catch automatically and impossible to catch by eyeballing logs.

Logs equal trust. The audit trail is what makes your DR process credible to others — engineers, management, auditors. Without it, you're asking people to take your word for it.

Minimal input reduces errors. We trimmed required inputs to just host + DB name and let scripts infer the rest. Less to type = fewer mistakes under pressure.

Practice makes permanent. Each drill found a small improvement. After ten drills, the process was genuinely fast and boring — which is exactly what you want.

The Outcome

We went from a 3-hour manual war room exercise to a single Jenkins job anyone can trigger. The drills are now predictable, repeatable, and quick.

More importantly — everyone on the team believes the restores work, because the validation script proves it every single time.

Boring DR is good DR.

Running MongoDB in production? When did you last drill a full restore? Drop your setup in the comments — curious how teams handle validation.

How Kong's Control Plane / Data Plane Split Cut Our Gateway Costs by 34% (And Made It a Security Layer)

Mrinal Narang — Fri, 12 Jun 2026 15:13:19 +0000

The Problem With How Most Teams Run Kong

If you set up Kong the default way, everything lives together — routing, policy enforcement, plugin execution, live traffic handling. One deployment doing all the things.

It works. Until it doesn't.

When traffic spikes, you scale up. But you're scaling the control plane too, which barely does anything at runtime. You're paying compute for config management that gets touched only when something changes — not on every request.

That was us. Scaling more than we needed to, paying for it, and not realizing why.

Splitting Control Plane from Data Plane

The data plane is hot. It handles every live request, every millisecond, 24/7. It needs to be fast, lean, and close to your services.

The control plane is cold. It pushes config — route definitions, plugin settings, policy changes. It fires when something changes, then sits quiet.

When you separate them:

Data plane scales with your actual traffic
Control plane runs small and cheap, sized for config ops not request volume
You stop paying for compute you're not using

That architectural change alone dropped our gateway infra cost by 34%. No feature removal. No degraded performance. Just stop running one thing at the scale of another.

Then We Added Plugins — And Kong Became Something Else

This is where it gets interesting. Once the infra is clean, you can actually think about what Kong should be doing for your stack.

JWT Validation at the Gateway

Every request carries a token. Kong verifies it before the request gets anywhere near a service. No valid token, request dies at the edge.

Your services stop writing auth logic entirely. No more 12 slightly different JWT implementations across 12 services. One place, one standard, enforced consistently.

plugins:
- name: jwt
  config:
    secret_is_base64: false
    claims_to_verify:
      - exp

OAuth 2.0 for Third-Party Integrations

Handled at the gateway, not scattered across services. External partners authenticate once at the edge. Your internal services never see unauthenticated traffic.

Rate Limiting Per Consumer, Not Just Per Route

This is the one most teams miss. Route-level rate limiting is blunt. Consumer-level rate limiting is precise.

plugins:
- name: rate-limiting
  consumer: free-tier
  config:
    minute: 100
- name: rate-limiting
  consumer: enterprise
  config:
    minute: 10000

Same plugin, same gateway, different policy per JWT claim. Free tier gets 100 req/min. Enterprise gets 10,000. Zero application code involved.

Request Transformation

Strip headers you don't want passing through to services. Inject headers your services expect. Normalize payloads from external partners sending data in formats your team didn't design for — all before the request touches your backend.

IP Whitelisting on Internal Routes

Certain paths accessible only from known sources. One config block. Applies across the entire stack.

What This Actually Changed

Before: auth logic lived in every service. Every team implemented it differently. Every security audit found inconsistencies. Every new service started from scratch building things that had already been built six times.

After: the gateway owns identity, rate policy, request shape, and access control. Services own business logic. That boundary is clean and it stays clean.

When we did a security audit post-migration, the findings dropped significantly. Not because we wrote better application code — we hadn't touched it. Because we moved the security surface to one place and made it consistent.

The Architecture That Came Out of This

External Traffic
      │
      ▼
[Kong Data Plane]  ◄──── [Kong Control Plane] (small, separate, cheap)
      │                         │
  JWT auth                   Config push
  Rate limiting               Plugin management
  Request transform           Route definitions
  IP whitelist
      │
      ▼
[Your Services]  ←── Business logic only

The data plane is the only thing in the hot path. The control plane is a config server. Your services are finally just services.

TL;DR

Split control plane from data plane → stop scaling what doesn't need to scale
JWT at the gateway → services never handle auth again
Per-consumer rate limiting → fine-grained control without application changes
Request transformation → normalize at the edge, not inside your code
One security surface → consistent, auditable, maintainable

If you're running Kong as a glorified reverse proxy, you're leaving most of its value on the table.