Mustafa ERBAY

Posted on May 10 • Originally published at mustafaerbay.com.tr

My Own VPS Crisis: That Moment of Panic During a Client Meeting

#career #vps #servermanagement #incidentresponse

That Moment of Panic During a Client Meeting

It was around 10:00 AM last Tuesday. As I was presenting live to a client, a notification sound came from my system. A quick glance at the screen revealed that a service running on my monitor was no longer responding. The cold sweat I felt at that moment will likely be one of the most unforgettable moments of my career. We were doing a critical demo for a major project we had been working on for months, and just as we reached the most important part, a service on my own VPS had crashed.

As someone with many years in the field, these kinds of situations aren't new to me. However, this time was different; I was at the center of it, and I had to try and salvage the situation without abandoning the presentation. Instead of turning to the client and saying, "One second, we're experiencing a system issue," I needed to quickly focus on resolving the problem without hitting the panic button. This post details the crisis I experienced that day, the approach I took, and what I learned to handle such situations.

Details of the Crisis: What Happened?

It took me a few seconds to fully grasp what was happening during the meeting. The notification from my system indicated that nginx was no longer routing traffic to my backend application. The client's eyes were on the screen, waiting for me to continue. While I kept talking, my eyes scanned the terminal for the output of systemctl status myapp. And there it was: "Active: failed (result=oom-killer)".

🔥 OOM Killer Activated!

The OOM Killer (Out Of Memory Killer) is a mechanism in the Linux kernel that terminates specific processes when the system runs out of memory. This prevents the system from crashing entirely but causes the terminated application to stop. This is exactly what happened in my case.

As you can imagine, this output was not good news. My application had consumed so much memory that the kernel had automatically shut it down. The meeting with the client had essentially turned into an incident response scenario. At that moment, instead of thinking, "I wish this server had more RAM," I had to focus on the question, "How do I fix this right now?"

Situation Assessment and Initial Interventions

Trying to minimize disruption to the meeting as much as possible, I quickly ran a few commands. First, I checked the dmesg output. This command shows kernel messages and provides detailed information about when and why the OOM Killer was triggered. In the output, I could clearly see the PID of my application and how much memory it was consuming. It was using approximately 3.5 GB of RAM, which was quite high for my VPS under normal circumstances.

Next, I checked the real-time memory usage with the top command. The sight was this: the myapp process was using significantly more memory than I expected. It shouldn't normally consume that much. Was there a memory leak, or had the latest update caused the issue? While talking to the client, I was simultaneously conducting a detailed review with htop.

ℹ️ Why Htop?

htop is an interactive process viewer with a more user-friendly interface than the top command. I prefer it in such situations because it allows me to see memory and CPU usage more clearly.

Restarting the Halted Application

My first priority was to get the application back online. I restarted the application with the systemctl restart myapp command. This command restarts the process that was terminated by the OOM Killer. A few seconds later, when I checked with systemctl status myapp, I saw that the application was active (running). This allowed me to save that critical moment of the meeting. As I told the client, "Yes, I think we've resolved the issue," I let out a deep breath internally.

However, this was a temporary solution. I needed to find the root cause of the problem and implement a permanent fix. Otherwise, I could face a major issue in the next meeting, or worse, in production. The biggest advantage of running on my own VPS was that I could investigate such issues whenever I wanted. As soon as the meeting with the client ended, I began a deep dive analysis.

Root Cause Analysis: Memory Leak or Something Else?

As soon as the meeting concluded, I SSH'd into my VPS. My first task was to examine the logs related to the OOM Killer activation in more detail. Using journalctl -xe to view system logs, I noticed that my application was consuming excessive memory, particularly during a specific operation. This operation was an API endpoint that processed incoming data.

💡 The Power of Journalctl

journalctl is a powerful tool for managing logs in Linux systems. When used with the -xe parameter, it allows me to see the details of the error along with other relevant system messages. This speeds up the debugging process.

Since my application is Node.js-based, I also reviewed Node.js's memory management and garbage collection mechanisms. I started using tools like process.memoryUsage() to monitor the V8 JavaScript engine's memory usage. These tools show the application's heap usage and total memory consumption in real-time.

Lessons from My Astro Build OOM Experience

This incident reminded me of a situation I encountered about a month ago with my Astro project. Astro's build process could also consume high amounts of memory at times. Back then, I encountered a process that used 90% of the system's memory during the build and was eventually OOM-killed. In that experience, I saw that Astro's astro build command could consume excessive memory, especially with certain plugins or large projects.

The problem this time wasn't related to the build process but to a runtime issue. However, in both cases, the fundamental cause was the same: insufficient memory or a memory leak. Since I manage over 13 Docker containers on my own VPS, I need to carefully monitor the memory usage of each service. OOM situations like these can affect not only the service but the entire system.

Detecting the Memory Leak

Detecting a memory leak in Node.js is often difficult. However, I observed that the memory usage continuously increased when a specific function of the application was triggered and never decreased. This implies that a reference to an object was being held and could not be released by the garbage collector. Examining the codebase, I found that in a module processing incoming data, temporary objects created while processing large datasets were not being properly cleaned up.

Specifically, new data was constantly being added to an array, and the size of this array was not being controlled. Even after the operation was complete, this array remained in memory. This situation led to an increase in memory usage over time and eventually triggered the OOM Killer. This was a mistake that I could say, "it happens," rather than something I would describe in a "corporate consultant" tone.

The Solution: Fixing the Memory Leak and Implementing Safeguards

Once I identified the root cause of the problem, implementing the solution was relatively straightforward. I found the code block causing the memory leak and ensured that the relevant objects were properly released from memory by assigning them to null or letting them go out of scope after processing the incoming data.

// Old and problematic code example (simplified)
let largeDataArray = [];
function processData(newData) {
  largeDataArray.push(newData); // Data is continuously added, and the array grows
  // ... data processing ...
  // No cleanup or limitation of largeDataArray
}

// Corrected code example
let largeDataArray = [];
const MAX_ARRAY_SIZE = 1000; // Maximum size determined

function processData(newData) {
  largeDataArray.push(newData);
  if (largeDataArray.length > MAX_ARRAY_SIZE) {
    largeDataArray.shift(); // Array is limited by removing the oldest element
  }
  // ... data processing ...
}

// Alternatively, the object can be set to null after processing
function processDataAndClean(newData) {
  let processedItem = { ...newData, processed: true };
  // ... data processing ...
  processedItem = null; // Object released after processing
}

After making this change, I restarted the application and continued to monitor memory usage with htop. This time, the memory usage remained stable, and the OOM Killer was not triggered. However, this was just the first step. I needed to take additional measures to prevent such incidents from happening again.

Future Safeguards

More RAM: I considered increasing the RAM capacity of my current VPS. However, this was an option that would also increase costs. Using existing resources more efficiently has always been my first choice.
Monitoring and Alerting: I set up a more detailed monitoring system using tools like Prometheus and Grafana. I configured it to receive automatic alerts when memory usage exceeds a certain threshold. This way, I can detect the problem before the OOM Killer is triggered.
Resource Limits (Docker): Wherever possible, I have set specific memory limits for each container. This prevents one container from taking over the entire system. For example, docker run --memory=512m ....
Automatic Restart Mechanisms: I made the Restart=on-failure setting in systemd more aggressive. This ensures that the application restarts automatically if it crashes.
More Frequent Code Reviews: I established a process for more frequent reviews of critical sections, especially those related to memory management.

This crisis once again reminded me that managing my own infrastructure is both a great responsibility and a continuous learning process. That moment of panic during the client meeting pushed me to be more careful and better prepared.

In my next post, I will explain in more detail the "preflight resource guard + auto-fix + dedup-alert" pattern I use in such incident response processes.

DEV Community