Swap Fire on My VPS: A Nightmare That Started with a Kernel CVE Patch

#life #vps #swap #kernel

Sudden Increase in Swap Usage: Symptoms and Initial Findings

This morning, when I connected to my server, I wanted to check the system health before starting my workflow as usual. However, the scene I encountered was far from what I was accustomed to. The htop output showed the mysqld process consuming an abnormally high amount of CPU and RAM, but what really caught my attention was the swap usage reaching almost full capacity. When I checked with the free -h command, I saw that almost the entire system was using swap space. This situation had drastically reduced the server's performance, causing many services to become unresponsive. Normally, swap usage on this server remains at a minimum because I have sufficient RAM, and processes are usually kept in memory. I needed to conduct an in-depth investigation to find the source of this sudden increase.

My first reaction was to check if I had made any recent updates. I reviewed the apt history log records for the past week. The most significant change I noticed was a Linux kernel update that was automatically applied a few days ago: linux-image-6.5.0-27-generic. It was critical to understand how this update affected the system's swap behavior. Kernel updates are usually made to improve performance or close security vulnerabilities, but they can sometimes lead to unexpected side effects. Situations like this on my own VPS constantly remind me how careful I need to be at every layer of the system architecture.

ℹ️ Why Does Swap Usage Increase?

Swap space is a disk area used by the operating system when physical memory (RAM) is insufficient. If your system's RAM becomes full, the operating system temporarily moves less frequently used memory pages to the swap area. While this frees up space for new processes, it reduces system performance because disk access is much slower than RAM access. Sudden and continuous high swap usage is usually a sign of either a memory leak or an application struggling to cope with insufficient RAM.

Kernel Update and CVE Impact

To find the root cause of this problem, I continued to examine the dmesg outputs and system logs. The journalctl -xe command showed detailed error messages from the last 24 hours. The error messages I saw were specifically related to memory management and suggested that the kernel was having trouble allocating memory under certain conditions. The first thing that came to mind was that the latest kernel update might have brought security patches with it. I researched if there were any memory management vulnerabilities associated with any CVE (Common Vulnerabilities and Exposures) numbers.

While examining the latest updates for the Linux kernel's 6.5.x series, a vulnerability named CVE-2026-31431 caught my eye. This vulnerability was specifically related to swap management and could, under certain conditions, cause the kernel to perform erroneous memory reads or writes. This situation could disrupt the system's memory management, leading to an unexpected increase in swap usage and even making the system unstable. It perfectly matched the scenario I was experiencing. The update was intended to fix this vulnerability, but it seemed the patch interacted with a specific configuration or usage pattern on my system, causing a reverse effect.

Situations like this point to an approach called "fix on break." A change made to close a security vulnerability can sometimes lead to unforeseen new problems. Especially in complex systems, it is quite difficult to test all possible scenarios for every patch. Therefore, every update deployed to a production environment needs to be carefully monitored, and one must be prepared for potential issues. Encountering such a problem on my own VPS once again painfully reminded me of this principle.

Debugging: The Process of Debugging What Triggered Swap

After generally identifying the source of the problem, I began a more detailed debugging process. My goal was to concretely determine which process or system behavior was triggering such high swap usage. The strace command is a powerful tool for monitoring the system calls and signals a process makes. Examining calls like malloc and mmap can be useful, especially for understanding memory allocation issues. However, since the problem was increasing swap usage across the entire system, I turned to broader tools.

The perf tool is an excellent option for performance analysis on Linux systems. With the perf top command, you can see the processes and functions consuming the most CPU time in real-time. However, my problem was not CPU-bound but rather memory and swap usage. Therefore, I recorded page faults and memory accesses on the system using commands like perf record -g -e page-faults,cycles,instructions,major-faults,minor-faults -- sleep 60. I then analyzed this data with perf report. The records showed intensive calls to the mysqld process and the kernel's __handle_mm_fault function. This indicated that the system was constantly moving memory pages from disk to swap space and reading them back.

# Monitor swap usage in real-time

<figure>
  <Image src={cover} alt="A graph showing swap usage peaking on a VPS terminal screen." />
</figure>

watch -n 1 free -h

# Search system logs for memory management related errors
sudo journalctl -xe | grep -i "memory\|swap\|oom\|fault"

# Monitor system calls for a specific process (e.g., mysqld)
sudo strace -p $(pgrep mysqld) -s 256 -e trace=memory

Based on these analyses, I became even closer to the conclusion that the problem was not directly with the mysqld process itself, but rather a bug in the kernel related to swap management was being triggered. Specifically, the installation of the linux-image-6.5.0-27-generic package after the automatic apt upgrade operation that occurred during the night hours pointed to the start of the problem.

Solution: Rolling Back the Patch and Configuration Adjustments

After identifying the source of the problem, my first step was to temporarily roll back the kernel version that was causing the issue. This was critical to ensure the system ran stably and to gather more data. I listed the installed kernel images with the dpkg --list | grep linux-image command and used sudo apt remove linux-image-6.5.0-27-generic followed by sudo apt autoremove to revert to a previous stable version. Then, I installed a previous, trusted kernel version (e.g., linux-image-6.5.0-26-generic) with sudo apt install linux-image-6.5.0-26-generic.

After restarting the system and booting with the old kernel version, I ran the free -h command again. I observed that swap usage had returned to normal, and the mysqld process was no longer exhibiting abnormal memory usage. This definitively confirmed that a bug in the new kernel version was causing the problem. However, simply rolling back the patch was not a permanent solution. Security vulnerabilities still existed and would need to be addressed at some point.

Therefore, my next step was to research ways to temporarily disable or mitigate the impact of the relevant CVE in the new kernel version. It is possible to manage swap behavior by adjusting kernel parameters or changing sysctl settings. Specifically, the vm.swappiness value determines how aggressively the system uses swap. An setting like sysctl vm.swappiness=10 reduces the system's inclination to use swap. However, this alone would not solve the problem, as the issue stemmed more from a bug within the kernel itself.

⚠️ Risks of Rolling Back Kernel Patches

Rolling back kernel patches can be a temporary solution, but it can leave your system vulnerable to security exploits in the long run. Therefore, when a patch is rolled back, it is essential to track when and how the relevant security vulnerability will be fixed and to transition to an updated and secure version as soon as possible. During this process, you may also consider implementing additional security measures to reduce potential attack vectors against your system.

Long-Term Solutions and Preventive Measures

This experience once again showed me the potential risks of automatic kernel updates on my production servers. It would be a safer approach to have such critical updates applied after undergoing a testing process and manual approval, rather than being applied automatically. This can be achieved through strategies like "canary deployment" or "blue-green deployment," where updates are first tested on a small group of servers and then rolled out to all servers.

Furthermore, I realized that I need to further enhance my system monitoring tools. Setting up alarms that closely track metrics such as swap usage, page faults, and memory allocation errors, in addition to CPU and RAM usage, will help in the early detection of similar issues. With tools like Prometheus and Grafana, I can collect and visualize these detailed metrics and receive automatic notifications when anomalies are detected. This is also part of the "observable systems" principle.

Finally, to be able to respond more quickly in case of a similar problem, I plan to create a "runbook" that outlines the basic troubleshooting steps. This runbook will address specific scenarios like a sudden increase in swap usage and will detail, step-by-step, which commands to run, which logs to examine, and what temporary solutions can be applied. Such preparations help in systematically resolving the problem without panicking in crisis situations.

One of the most important lessons I learned during this process is that, in addition to technical solutions, processes and monitoring mechanisms are just as important as the code itself. This swap fire on my own VPS was not just a technical problem but also an opportunity for me to review the processes in my infrastructure management.

# Making sysctl settings permanent (e.g., by adding to /etc/sysctl.conf)
# vm.swappiness = 10

By implementing these steps, I managed to extinguish the swap fire on my VPS and restore my system to stability. This experience demonstrated that infrastructure management is not just about building "working" systems, but also about being prepared for unexpected problems and continuously improving.

# Example of reverting to a previous kernel version and pinning updates
# List installed kernel versions
dpkg --list | grep linux-image

# Remove the unwanted version
sudo apt remove linux-image-6.5.0-27-generic

# Install the previous version (the latest installed version is selected by default)
sudo apt install linux-image-6.5.0-26-generic

# Update Grub
sudo update-grub

# To prevent this package from being installed in future automatic updates
sudo apt-mark hold linux-image-6.5.0-26-generic

These events remind us that technology is constantly evolving and can always present new challenges. What's important is to remain calm in the face of these challenges and to find solutions with a systematic approach.