What Happens When You Don't Set Up Monitoring? A Bitter Lesson from

#monitoring #devops

Watching a system silently die was one of the most helpless moments of my career. What's more, this death was merely the result of screams that had gone unheard for days. For me, monitoring isn't just about installing a tool; it's about establishing communication with the system, hearing its breath, its heartbeat.

In my twenty years of experience, I've always seen the most complex problems arise from the simplest oversights. And at the forefront of these oversights, more often than not, was inadequate or non-existent monitoring. Monitoring a system is an indispensable step to being proactive and preventing potential disasters.

Before Symptoms Knock: Unheard Cries

While developing the ERP for a manufacturing company, there was a critical data transfer job that ran at night. This job, for iSCSI-based supply chain integration, ran once a day and normally completed within two hours. One morning, there was massive chaos due to inconsistencies in shipment planning.

It took us three days to realize that the job was stuck due to an integration error and had been running for over six hours. Since the OnFailure setting of the systemd unit was not properly configured, no one was notified of the error. Add to that the fact that no one was regularly checking journald logs in that busy period, and a silent disaster had occurred. The result was incorrectly loaded trucks, empty returning vehicles, and hundreds of thousands of liras in operational damage. This situation was the direct cost of a lack of monitoring.

Silent Killers: Details Missed Without Monitoring

Systems often whisper to us what's happening; if we don't know how to listen, those whispers can suddenly turn into a noisy crisis. In the backend of one of my side products, I noticed Redis inexplicably starting to consume memory. Due to an incorrect OOM eviction policy choice, old data wasn't being deleted, and the service was slowly bloating. If I had regularly monitored Redis's used_memory metric, this issue would have triggered an alarm before it even started; however, I had to discover it through manual checks.

I experienced similar situations on the PostgreSQL side. In a client project, disk space was slowly depleting due to a WAL bloat problem. Since there was no vacuum monitoring, no one noticed the issue until it reached a critical level. On the network side, intermittent connection drops caused by MTU/MSS mismatches or VLAN tagging complexities were simply dismissed as "it happens sometimes" without monitoring. On the security side, not knowing if fail2ban was working effectively enough, or if auditd logs were being audited, always bothered me.

Cost Calculation: The Price We Ignore

One of the most striking monitoring stories I experienced was at a large Turkish e-commerce site. A disk on one of the servers slowly started to fill up. Unfortunately, there was no monitoring system or alarm for critical disk usage metrics. On the night of April 28th at 03:14 AM, the disk reached 100% capacity, and the entire payment system went down. No WAL rotation alarm was triggered because such a system didn't exist.

The problem was only discovered at 09:00 AM, following numerous complaints from customers. That six-hour outage led to hundreds of thousands of liras in lost revenue and irreversible customer dissatisfaction. This situation was a bitter proof that the cost of monitoring setup is a drop in the ocean compared to the cost of an outage.

⚠️ An Important Lesson

The budget and time allocated for monitoring setup are usually far less than the cost of an impending crisis. This is not just a technical investment, but also a strategic investment for business continuity and reputation.

A Matter of Culture: Monitoring Is More Than Just a Tool

In my opinion, monitoring is not just a set of graphs and alarm panels. It's an indicator of a team's responsibility towards its systems, a proactive approach, and a desire to solve problems at their root. Imagine a scenario where a company has three different ISPs at its exit, and voice packets get corrupted if DSCP marking is not done correctly. Without monitoring such critical details, finding the source of problems is like searching for a needle in a haystack.

It's not enough to just keep systems running. We also need to know how they are running, when they are struggling, and when they need help. In my experience, the only bridge that provides this information flow is an effective monitoring infrastructure.

So, what was the most expensive lesson caused by a lack of monitoring in your career? Or, conversely, do you have a memory of a good monitoring system saving you? Share your thoughts in the comments.