DEV Community

Nema Chandra Goswami
Nema Chandra Goswami

Posted on

The 18 Minutes That Tested Me

Monday. 10:15 AM. Coffee in hand.

Life was good.

Then my phone buzzed.

It was Jaya from Sales.

“The client portal is down. Customers are calling. What’s happening?”

That one sentence changes the atmosphere in seconds.

I opened the production URL.

502 Bad Gateway.

Silence.

The Situation was...

Our platform was running on Amazon EC2, served through Nginx.

It had been stable for months.

No recent risky deployments. No alerts overnight.

So why now?
Enter fullscreen mode Exit fullscreen mode

I immediately looped in:

My manager, Aman
Jaya from Sales
Backend & DevOps team

Aman asked the question every leader asks:

“What’s the impact?”

Jaya didn’t sugarcoat it.

“Two new companies was onboarded recently.”

Pressure level? Maximum.

The Investigation...

✔ EC2 instance — Running ✔ CPU — Normal ✔ Memory — Stable ✔ Nginx — Active

Everything looked… fine.

But production doesn’t lie.

We checked application logs.
Enter fullscreen mode Exit fullscreen mode

And there it was.

Database connection failures...
Enter fullscreen mode Exit fullscreen mode

At the same time, Sales had launched a marketing campaign that morning. Traffic spiked. Our connection pool maxed out.

Success… broke the system.

The Decision

We had two options:

Restart everything and pray.

Fix it properly.

Aman asked calmly,

“What do you suggest?”

In moments like this, you don’t just answer — you own it.

I said:

Increase DB connection pool

Vertically scale the EC2 instance

Restart services in sequence

Add monitoring immediately

Plan auto-scaling after recovery
Enter fullscreen mode Exit fullscreen mode

We executed.

Clock ticking...

8 minutes. 12 minutes. 15 minutes.
At 18 minutes...

System stable. Portal loading. new companies was available for business.
Enter fullscreen mode Exit fullscreen mode

The Hidden Battle: Communication

While engineers fixed the backend, I stayed connected with Jaya.

Not panic.

Not excuses.

Just clarity.

“We are experiencing high traffic and scaling infrastructure. ETA: 15 minutes.”

That message saved trust.

Because outages hurt systems.

But silence hurts relationships.

What It Taught Me

Systems fails, Traffic surprises you, and Infrastructure has limits.

But leadership is tested in:

• Decision speed
• Clear communication
• Confidence under pressure

Later, Aman told me:

“You handled it well.”

And that meant more than fixing the outage.

Top comments (0)