〽️ 𝙍𝙤𝙨𝙝𝙖𝙣

Posted on May 17

Debugging Production Alerts Without Chasing The Wrong Problem

#devjournal #monitoring #performance #sre

oday I worked through a production alert that looked simple at first:

“Services are unhealthy. Memory issue?”

But once we started digging, it turned out to be a mix of real issues, noisy errors, and one confusing timezone mismatch.

The First Assumption

The alert mentioned memory problems, worker restarts, database warnings, and a third-party integration error.

At first, it was tempting to group everything together and say:

“The system is running out of memory.”

But that would have been too broad.

So instead, we split the investigation by service.

What Was Actually Happening

One web service really was hitting memory pressure. Its process manager was killing workers after the total memory usage crossed the configured limit.

That part was real.

But another background worker was reported as “crash-looping.” When we checked the logs, it did not show the usual signs of a memory crash.

There were no lines like:

Process too large
Out of memory
Children dying rapidly
Exited with status 255

Instead, the logs showed graceful shutdowns and normal restarts:

Shutting down
Scheduler exiting
Bye
Booted application
Starting memory monitoring

That changed the conclusion. The worker may have restarted, but the logs did not prove it was crashing because of memory.

The Timezone Trap

Another confusing part was time.

One tool showed local time. Another exported logs in UTC.

So a window that looked like this in the UI:

20:30 - 21:05

was actually this in the logs:

16:30 - 17:05 UTC

That small difference can completely change an investigation.

If you query the wrong time window, you can easily miss the real event or accidentally blame the wrong one.

Other Noise

There were also database warnings. They were real, but they were related to a small set of records and not directly tied to the memory alert.

There was also a recurring third-party SMS error. It looked scary in the alert, but it was old background noise and not part of the main incident.

The Final Picture

After separating the signals, the situation looked more like this:

Web memory pressure: real
Background worker memory crash-loop: not proven
Database warnings: real, but separate
Third-party SMS error: noisy, unrelated
Timezone mismatch: caused confusion

Takeaway

Production debugging is not just about finding errors.

It is about separating related signals from unrelated ones.

Before acting on an alert, it helps to ask:

Which service actually produced this log?
Is this the correct time window?
Is the timestamp local time or UTC?
Do we see the real failure signature?
Is this new, or just recurring noise?

The biggest lesson from today:

Don’t fix the loudest symptom. First prove what is actually failing.

DEV Community