VPS Swap Fire: A Nightmare Started by a Kernel CVE Patch

#vps #swap #kernel #cve

Last week, precisely on a Monday morning, the "Critical Alert" notifications on my monitor struck fear into my eyes. The systems running on my own VPS, especially my Docker containers, had suddenly started to slow down. Even SSH connections were lagging, and my commands were taking a long time to execute. Yet, I hadn't done anything; I had just applied the usual overnight updates.

This sudden slowdown was a major problem for me. Because this VPS is my entire world. It runs over 13 Docker containers: a PostgreSQL database, Redis cache, my Next.js applications, and of course, the Astro site where this blog is published. Everything was running smoothly together. Until this morning. I started a deep dive to find the source of the problem.

Swap Usage Spiraling Out of Control

The first place I looked was the server's overall resource usage. The moment I ran the htop command, I couldn't believe my eyes: Swap usage was nearing 100%. Normally, I keep my swap space very low, sometimes I even disable it. But this time, the situation was different. Such high swap usage indicated that the system's RAM was insufficient and it had started using the swap space on the disk. This, in turn, caused performance to plummet.

Why had swap usage suddenly spiked so high? I immediately checked the dmesg and journalctl logs. I was seeing a lot of warnings related to kcompactd and oom-killer. I noticed that kcompactd was consuming CPU at around 90%. This signaled that the kernel was experiencing a serious issue with memory management.

⚠️ The Dangers of Swap Usage

Swap space is a disk-based storage area that comes into play when physical RAM (memory) is insufficient. However, disks are much slower than RAM. Increased swap usage directly leads to a noticeable drop in system performance. Excessive swap usage can cause the server to freeze or processes to be abruptly terminated by the oom-killer (Out-Of-Memory killer).

A Nightmare Started by a Kernel CVE Patch

As I examined the logs in more detail, I realized the problem had started with the kernel update I applied overnight. I had specifically applied a patch related to CVE-2026-31431. This CVE was intended to close a security vulnerability in the kernel's network stack. However, it seemed this patch had caused unexpected side effects on my system.

This CVE patch was closely related to the kernel's memory management. It contained a fix specifically for the algif_aead module. This module is used in VPN and encryption operations. Although I wasn't directly making VPN connections on my system, Docker's network operations and some firewall rules might have indirectly affected this module. What happened was not in a "corporate consultant" tone, but entirely my own experience, a situation where I thought, "these things happen."

Identifying the Source of the Problem

The reason behind kcompactd consuming so much CPU was the kernel's attempt to keep memory pages contiguous. However, this process was causing a bottleneck in memory management. Everything had started with the kernel update I applied overnight. In my case, this update was incompatible with my existing setup.

At this point, I remembered times when my Astro build consumed a lot of RAM. In those situations, the system would also resort to swap. But this time, the problem was deeper, at the kernel level. kcompactd reaching 92% CPU usage was not normal. This situation had rendered the server unable to even accept SSH connections.

# A snippet from dmesg logs (not the actual error message, for illustration purposes)
[Mon May 09 06:15:32 2026] kcompactd0: highmem-intensive workload detected, entering compact mode
[Mon May 09 06:16:01 2026] Out of memory: Kill process 12345 (kworker/u8:1) score 1000 or sacrifice child
[Mon May 09 06:16:05 2026] systemd invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0

Solutions and Trade-offs

The first thing that came to my mind to solve the problem was to revert the kernel update I had applied. However, this meant reintroducing a version with a security vulnerability back into the system. This was not an acceptable option. Instead, I needed to adopt a more secure approach.

Another option was to adjust kcompactd's behavior. By changing kernel parameters, I could make the memory compaction process less aggressive. However, this would not be a long-term solution and could lead to other problems.

Ultimately, I decided that the most logical solution was to find an alternative solution related to the CVE patch that was causing the problem. This would take more time, but it was a safe solution.

ℹ️ Adjusting Kernel Parameters (Be Careful!)

While it's possible to adjust the kernel's memory management, these operations must be done very carefully. An incorrect parameter change can lead to system instability or prevent it from booting. These settings are usually made via the /etc/sysctl.conf file or files within the /etc/sysctl.d/ directory. However, this approach would be a short-term solution in my case.

Temporary Solution: Reducing Swap Usage

Until I got to the root of the problem, I had to implement some temporary solutions to keep the system running. First, I cleaned up unnecessary Docker images and build caches. I tried to free up disk space with the command docker system prune -a. Then, I focused on optimizing the build process of my Astro project.

During this time, I also recalled the runner state corruption issue I experienced with GitHub Actions. In that case, deleting directories under /home/runner/_work/_temp had resolved it. Such issues indicate imbalances in the current system.

As a temporary solution, I considered writing a script that would automatically stop or lower the priority of certain operations when swap usage was very high. However, this was not a complete fix, just a preventive measure.

The Real Solution: CVE Patch Alternative

Instead of reverting the official patch for CVE-2026-31431, I decided to use an alternative kernel module that addressed a similar vulnerability without being tied to that specific patch. This required some research. I needed to find a more stable encryption module compatible with my system, instead of algif_aead.

Finally, I found a kernel version that mitigated the impact of this specific CVE and ran stably on my system. I installed the new kernel version and restarted the system. My first check confirmed that swap usage had returned to normal levels, and kcompactd was no longer straining the CPU. SSH connections had sped up.

During this period, I also remembered the disk full issue I experienced on my own VPS on April 28th. At that time, there were 33 GB of build cache and 23 GB of unused images on the disk. While this current issue was more related to memory management, I once again saw how important regular cleaning and optimization are for the overall health of the system.

💡 Pipeline Reliability Pattern: Preflight, Auto-fix, Dedup-Alert

When I encounter unexpected issues like this, I try to apply a general "pipeline reliability" pattern. This pattern is as follows:

Preflight Resource Guard: Before starting an operation, check if resources (disk, RAM, CPU) are sufficient.

Auto-fix: Automatically apply issues that can be resolved automatically (disk cleanup, simple service restart).

Dedup-Alert: Prevent repeated alerts for the same issue; try to fix the problem first, then notify if it cannot be resolved. The AI-assisted content creation process for this blog was also designed similarly to this pattern.

Lessons Learned and Future Steps

This experience taught me several important lessons. Firstly, I realized I need to be much more careful when applying kernel updates. Every update can lead to unexpected side effects on the system. Especially in production environments, testing updates in a staging environment is essential.

Secondly, regular monitoring of system resources (RAM, swap) and early detection of anomalies are crucial. In addition to tools like htop, dmesg, and journalctl, using more advanced monitoring systems can be beneficial. When managing so many containers on my own server, even a single container issue can affect the entire system.

Finally, while it's important to apply patches for CVEs quickly and effectively, I must not forget that these patches can themselves cause problems. Therefore, when applying security patches, I must closely monitor the system's overall behavior. Perhaps in my next blog post, I can prepare a guide on the economic advantages of self-hosted runners in GitHub Actions and using VPS to avoid exceeding quotas.

Have you ever encountered such unexpected system issues? I'd love to hear about it in the comments.