When I was going through the Kubernetes 1.22 release announcement blog, there was a woot moment in it for me. Within the section titled Major Themes, was a brief mention about a new alpha feature titled Node system swap support. My mind re-winded back to the year 2016, when I took up a new assignment with a Telecom MNC as PaaS platform developer. My job required me to spin up 2-node K8S clusters very often (back then we did not have tools like kubeadm, kubespray etc.) using Cloud VMs. I would sometimes forget disabling swap on the nodes and the cluster wouldn’t come up. After a bit of frustration, coffee and help from colleagues, I would then realize I had once again forgotten to disable swap!! Many of my fellow colleagues had same experiences as well, so much so, if someone faced issues in bootstrapping a K8s cluster, the first piece of query would be “Did you disable swap?”. Nostalgia apart, back then I did not dig deeper into the significance of disabling swap when standing up a K8s cluster.
So when I read the release announcement blog, I decided to dig deeper into this topic. Soon after the 1.22 release, Elana Hashman (Red Hat) published a blog, which shed more light into this new alpha feature. This blog is inspired by Elana’s blog and I have used some of it’s contents here.
My digging led me to issue #53533 raised by Denis Gladkikh in year 2017. This was a bug report that said: Kubelet/Kubernetes 1.8 does not work with Swap enabled on Linux Machines. Before I get into the details, lets digress a bit to understand the purpose of Swap Memory.
Swap is a space on the host disk that is used when the amount of physical RAM memory is full. When a Linux system runs out of RAM, inactive memory pages are moved from the RAM to the swap space, so RAM can be allocated to processes that require it the most. Swap space can take the form of either a dedicated swap partition or a swap file. In most cases, when running Linux on a virtual machine, a swap partition is not present, so the only option is to create a swap file. Swappiness is a Linux kernel property that defines how often the system will use the swap space. Swappiness can have a value between 0 and 100. A low value will make the kernel to try to avoid swapping whenever possible, while a higher value will make the kernel to use the swap space more aggressively.
In prior releases, Kubernetes did not support the use of swap memory: having swap available has very strange and bad interactions with memory limits. For example, a container that hits its memory limit would then start spilling over into swap. This also has an effect that if the memory gets exhausted on the node, it will potentially become completely locked up — requiring a restart of the node, rather than just slowing down and recovering a while later.
A workaround was available for those who really wanted swap. kubelet had to be run with
--fail-swap-on=false. Containers which do not specify a memory requirement will then by default be able to use all of the machine memory, including swap. This might only really be a viable strategy if none of the containers ever specify an explicit memory requirement…
There are a number of possible ways that one could envision swap use on a node.
- Swap is enabled on a node’s host system, but the kubelet does not permit Kubernetes workloads to use swap.
- Swap is enabled at the node level. The kubelet can permit Kubernetes workloads scheduled on the node to use some quantity of swap, depending on the configuration.
- Swap is set on a per-workload basis. The kubelet sets swap limits for each individual workload.
The alpha feature is limited in scope to the first two scenarios. The third scenario is not implemented yet. Perhaps this might be considered in a future release..
cgroupsv2 improved memory management algorithms, such as oomd, strongly recommend the use of swap. Hence, having a small amount of swap available on nodes could improve better resource pressure handling and recovery.
Applications such as the Java and Node runtimes rely on swap for optimal performance. Initialization logic of applications can be safely swapped out without affecting long-running application resource usage
There are numerous cases in which cost of additional memory is prohibitive, or elastic scaling is impossible (e.g. on-premise/bare metal deployments). Occasional cron job with high memory usage and lack of swap support means cloud nodes must always be allocated for maximum possible memory utilization, leading to over-provisioning/high costs.
Local development or single-node clusters and systems with fast storage may benefit from using available swap (e.g. NVMe swap partitions, one-node clusters). Linux has optimizations for swap on SSD, allowing for performance boosts.
For example, edge devices with limited memory. Edge compute systems/devices with small memory footprints (<2Gi). Clusters with nodes <4Gi memory
Swap can be enabled as follows:
- Provision swap on the target worker nodes,
- Enable the NodeMemorySwap feature flag on the kubelet,
- Set --fail-on-swap flag to false, and
- (Optional) Allow Kubernetes workloads to use swap by setting
MemorySwap.SwapBehavior=UnlimitedSwapin the kubelet config.
MemorySwap.SwapBehavior configures swap memory available to container workloads. May be one of :
“LimitedSwap”: workload combined memory and swap usage cannot exceed pod memory limit
“UnlimitedSwap”:workloads can use unlimited swap, up to the allocatable limit.
If the feature flag is enabled, the user must still set
--fail-swap-on=false to adjust the default behaviour. A node must have swap provisioned and available for this feature to work. If there is no swap available, but the feature flag is set to true, there will still be no change in existing behaviour.
The feature flag can be disabled while the
--fail-swap-on=false flag is set, but this would result in undefined behaviour. To turn this off, the kubelet would need to be restarted. If a cluster admin wants to disable swap on the node without repartitioning the node, they could stop the kubelet, set swapoff on the node, and restart the kubelet with
--fail-swap-on=true. The setting of the feature flag will be ignored in this case.
Having swap available on a system reduces predictability. Swap’s performance is worse than regular memory, sometimes by many orders of magnitude, which can cause unexpected performance regressions. Furthermore, swap changes a system’s behaviour under memory pressure, and applications cannot directly control what portions of their memory usage are swapped out. Since enabling swap permits greater memory usage for workloads in Kubernetes that cannot be predictably accounted for, it also increases the risk of noisy neighbours and unexpected packing configurations, as the scheduler cannot account for swap memory usage.
The performance of a node with swap memory enabled depends on the underlying physical storage. When swap memory is in use, performance will be significantly worse in an I/O operations per second (IOPS) constrained environment, such as a cloud VM with I/O throttling, when compared to faster storage mediums like solid-state drives or NVMe.
Use of swap is not recommended for certain performance-constrained workloads or environments. Cluster administrators and developers should benchmark their nodes and applications before using swap in production scenarios.
- The CRI has changed, so container runtimes (e.g. containerd, cri-o) won’t be able to actually accept swap configs until they have been updated against the new CRI.
- The work isn’t over with this release; there will be a multi-release graduation process. This feature won’t be graduating until at least 1.25.
Swap support in Kubernetes should appeal to a wide variety of users: cluster administrators and developers. This feature could be one more lever in reducing infrastructure costs for businesses. However since the feature is still in alpha stage, it is not ready for production usage. Readers are encouraged to perform benchmarking tests to evaluate the suitability for their workloads/use cases and provide feedback to K8s SIG Node WG via following means:
SIG Node meets regularly and can be reached via Slack (channel #sig-node), or the SIG’s mailing list. Feel free to reach out to Elana Hashman (@ehashman on Slack and GitHub) if you’d like to help.