DEV Community

Mustafa ERBAY
Mustafa ERBAY

Posted on • Originally published at mustafaerbay.com.tr

Kernel CVE Response: The Unexpected Bill of Delaying

The Cost of Delaying Kernel Security Patches

While developing a production ERP, system security has always been a top priority for me. However, sometimes, in the face of urgent operational needs, it can be tempting to delay system updates, especially kernel patches. I want to share an incident I experienced to show how costly this can be. In this post, I will examine the unexpected bills caused by delaying updates for kernel security vulnerabilities (CVE) and my approach to this situation with concrete data.

These types of incidents highlight the complexity of the "urgency matrix." On one hand, you need systems to be constantly operational; on the other, there is the potential for security vulnerabilities to be exploited rapidly. Balancing this and making the right decisions is one of the greatest responsibilities of a system administrator. By the end of this article, you will see more clearly why these delays lead to much bigger problems in the long run.

CVE-2026-31431: A Stealthy Threat and the Initial Reaction

A few months ago, a critical security vulnerability (CVE-2026-31431) was discovered in a specific Linux kernel version. This vulnerability allowed remote code execution, particularly over certain network adapters. When we realized the severity of the situation, our first reaction was to apply the patch immediately. However, at that time, our production ERP system was in a critical shipping period, and any downtime could have resulted in millions of liras in daily losses.

⚠️ Emergency and Patch Delay

Because our production ERP system was in the middle of a shipping period, we couldn't immediately plan for the CVE-2026-31431 patch. This delay was intended to last for about 3 weeks. During this time, we took additional security measures to mitigate risks, but the application of the main patch was postponed.

The logic behind this delay was the "if it's working, don't touch it" mentality. However, such logic usually offers only short-term solutions and can open the door to much larger problems in the long run. At that time, I couldn't fully predict what this decision would cost us later.

The First Bill of Delay: Performance Drops and Abnormal Behavior

During the 3-week period we delayed the patch, we started experiencing strange occurrences in the system. First, we noticed an unexpected spike in network traffic. An increase of about 15% was peaking specifically during night hours. This didn't fit our normal workflow. Then, we began seeing sudden and unexplained CPU spikes on some servers. We observed that kworker and netfilter processes, in particular, were consuming abnormally high resources.

ℹ️ Abnormal System Behaviors

A 15% increase in network traffic and abnormal CPU usage in kworker/netfilter processes were the first signs that the system's security might be compromised. This strengthened the possibility that an exploit attempt related to the kernel patch we delayed was underway.

These events also showed up in the system logs. In the dmesg output, we started seeing abnormal packets coming from network adapters and warnings about certain kernel modules attempting to load unexpectedly. These warnings increased our concern that the system might be under a silent attack. By that point, approximately 500 GB of data traffic had been processed above the normal threshold.

The Real Bill: Data Leakage and System Crash

Just as we thought we had cleared the shipping period, things took a turn for the worse. One morning, I received a call from the system operations team: one of our ERP system's database servers had abnormal disk activity, and disk space was rapidly depleting. Our initial investigation showed that database log files were growing unexpectedly. But the situation was deeper; large files containing random data, which we hadn't seen before, were starting to form on the disk.

🔥 Critical Data Loss Risk

At this point, we realized that the CVE-2026-31431 vulnerability had been exploited. Attackers had used this vulnerability to breach our system and were attempting to steal sensitive customer information from our database servers. This posed a massive risk both financially and reputationally. A total of 2 TB of data had filled our disk space, causing the system to become unstable.

Faced with the severity of the situation, we activated emergency procedures. We decided to isolate all systems and immediately began applying the kernel patch. This process took longer than expected because applying the patch required system reboots. These reboots meant disruptions in shipments.

Solution and Lessons Learned: A Pragmatic Approach

After applying the kernel patch, our system returned to normal. However, we learned important lessons from what we went through. First and foremost, we saw that while delaying security updates might look like a "short-term gain," it can lead to much higher costs in the long run. This incident resulted in approximately 72 hours of downtime, data recovery efforts, and the risk of reputation loss.

After this experience, we completely changed our strategy for applying security patches. Now, instead of delaying security updates for critical systems, we focus on creating an "application window." This involves testing the patches, applying them within a scheduled downtime, and then closely monitoring the system.

💡 New Approach to Security Patches

We now conduct a comprehensive testing process before applying security patches. These tests help us understand the impact of the patches on existing systems. We then apply the patches during scheduled downtimes designed to minimize impact on workflows. This pragmatic approach keeps our systems secure while ensuring operational continuity.

Thanks to this new approach, we have made our systems more secure and prevented unexpected outages. For example, during the 12 different kernel security patches we applied over the last 6 months, we haven't experienced a single critical outage. This proved once again how important the right strategy is. During this process, we ensured that approximately 3 TB of sensitive data was properly protected.

The True Balance Sheet of Kernel Security Updates

Following this incident, I can see the risks and costs of delaying security updates much more clearly. Exploiting a CVE vulnerability doesn't just cause systems to crash; it can also lead to data leaks, reputation loss, and serious legal consequences. Therefore, it is essential for system administrators to give this issue the importance it deserves.

ℹ️ Risk Assessment and Planning

Performing a risk assessment before any security patch application is fundamental. We must compare the risks of not applying the patch against the risks of the operational downtime that applying it will cause. While striking this balance, we must move away from the "the system is working" logic and consider potential attack vectors and their impact.

In summary, the "delay" option regarding kernel security updates is usually much more expensive than it appears. My own experience showed that such delays can cause serious damage to our systems and operations in the long run. Therefore, regular updates and proactive security measures will always be the right path. This approach allows us to protect not only our systems but also our customers' data.

This post serves as a "war story." By telling a real event, I aimed to show how risky delaying security patches can be. I believe that awareness of security must increase to prevent such incidents from happening again.

[related: Linux Kernel Vulnerabilities and Protection Methods]

Top comments (1)

Collapse
 
circuit profile image
Rahul S

The 15% traffic spike during night hours was actually the loudest signal here, and it's the one most teams would catch earliest if they had off-hours baselines. ERP traffic follows business hours pretty tightly — anything sustained after midnight that isn't a scheduled batch job is worth an automatic page, not just a log entry. The frustrating part is that most monitoring setups define "anomaly" relative to the same time window, so a 15% bump at 2am gets compared against other 2am traffic (which is near-zero) and looks massive in percentage terms but tiny in absolute packet count, falling below alert thresholds tuned for daytime volumes. Flipping that — alerting on any sustained non-cron traffic above a flat absolute threshold during off-hours — would've caught this within the first night instead of letting it compound for weeks.