DEV Community: Serge Logvinov

Proxmox Virtual Machine optimization - Deep Dive

Serge Logvinov — Sun, 22 Feb 2026 09:07:23 +0000

In the previous articles, I covered the basic VM settings you should configure by default in Proxmox VE.

In this article, I’ll explain what actually happens under the hood and why proper CPU, NUMA, and interrupt configuration is critical for high-performance workloads - especially if you run latency-sensitive services or Kubernetes worker nodes.

CPU Affinity - is not enough

When you configure a CPU affinity list in Proxmox:

The VM is restricted to a predefined set of physical CPU cores.
All vCPUs are allowed to run on those cores.
However, the hypervisor scheduler can still move individual vCPUs between the allowed cores.

If you allow cores 0-7, then: vCPU-1 may run on core 0 now, then move to core 3, then to core 6.
The VM expects predictable CPU behavior, especially for workloads like databases, networking services, or Kubernetes nodes, which have their own optimizations based on CPU cache and latency. Moving VM CPU cores around can cause unpredictable performance degradation.

To achieve stable performance, you need to ensure that each vCPU of the VM is pinned to a specific physical CPU core.
Proxmox does not automatically pin each vCPU one-by-one.

You must configure this explicitly by scripts or automate it — see solution below.

Memory Numa nodes - avoid cross-node memory access

Modern servers (CPUs) use NUMA architecture.

If NUMA is not configured correctly: Proxmox allocate memory across multiple NUMA nodes, and the VM may have to access memory from different NUMA nodes. This results in increased latency and cross-node (socket) memory access, which can significantly degrade performance.

To avoid this, you need:

identify which physical cores belong to each NUMA node.
define VM CPU affinity within a single NUMA node.
configure VM NUMA settings to match that node.

Why use only one NUMA node for a VM is the best strategy? Because qemu does not provide a cpu architecture of the host machine by default.
You need to use qemu arguments to pass it.

In case if you have CPU cores and threads from one numa node - this arguments is enough:

args: -cpu 'host,topoext=on,host-cache-info=on' -smp '4,sockets=1,cores=2,threads=2,maxcpus=4'

Do not forget to set cores and threads according to your VM CPU configuration.

SR-IOV devices

Each hardware device uses interrupts to notify the Linux kernel that it has data to process.
If the interrupt is handled by a different CPU core than the one running the VM’s vCPU, several problems may occur:

CPU cache misses
cross-core synchronization overhead
increased memory traffic

We need to set hardware interrupt handling list the same as the CPU affinity list of the VM to solve this problem.

Solution

To automate all of this, I created an open-source component as part of Karpenter for Proxmox

The Proxmox Scheduler:

observes running VMs
reads their CPU affinity configuration
pins each vCPU to a specific host core
sets correct interrupt affinity for SR-IOV devices
optionally optimizes CPU frequency governor for power consumption.

It distributes as a deb package and can be installed on the Proxmox host:

dpkg -i https://github.com/sergelogvinov/karpenter-provider-proxmox/releases/download/v0.10.1/proxmox-scheduler_0.10.1_linux_amd64.deb

You can also optimize power usage:

set CPU governor to performance for cores used by VMs
set CPU governor to powersave for unused cores

References

Do you need a free-tier to learn Kubernetes?

Serge Logvinov — Tue, 30 Dec 2025 02:40:59 +0000

This is my opinion, based on my experience and on job interviews with candidates.

It is very important to understand the core components of Kubernetes — what happens when you create a new cluster and what is happening behind the scenes.
This knowledge will help you troubleshoot issues in the future. In such cases, I do not believe ChatGPT can fully help if you do not already understand the fundamentals.

Pods, Deployments, Services, and other basic components are relatively easy to learn through video courses or ChatGPT explanations.

Since most Kubernetes clusters run in the cloud, it’s important to understand how Kubernetes communicates with a cloud provider.
Bootstrap your own simple cloud. Proxmox is a good alternative and a vendor lock-in-free solution.

One Proxmox node is enough to get started, 4 CPUs and 16 GB of RAM are sufficient to run two virtual machines.

one Control-plane node
one Worker node

You will delete and recreate the worker nodes many times. This process will help you understand how the cluster works.

Use well-known kubernetes distributions like Talos. It is easy to install, and GitHub provides many examples on how to set it up.
Talos has a dedicated distribution for Proxmox, so you can use it without issues by following the official documentation.

Most kubernetes clusters include components:

CCM - cloud controller manager
CNI - container network interface
CSI - container storage interface
Node automation - tools like Cluster Autoscaler or Karpenter

In well-known cloud providers, some components are configured for you by default, such as the CCM and CNI.
All other components usually need to be installed manually.
In a home lab, you need to install all components yourself.
This helps you better understand how Kubernetes works in a cloud environment.

In a Proxmox installation, all required components already exist on the internet, mainly on GitHub.
Today, self-hosted solutions based on Proxmox can give you almost the same experience as public clouds, while helping you build strong knowledge of core Kubernetes components.
After that, switching to a public cloud will be much easier for you.

Install Proxmox CCM, Proxmox CSI, and Karpenter. The CNI is already included in the Talos distribution.
Do everything manually. Write down all the steps and save them in your GitHub repository.
Play with your cluster, break things and fix them again.
Deploy simple applications, scale them up and down, and monitor resource usage.
Try to automate the installation using Terraform.
Save the code in your GitHub repository and create step-by-step instructions explaining how to deploy it.
Then use GitOps best practices to manage your cluster with Argo CD or Flux CD.

Just remember, all these steps already exist on the internet. You can search for them on Google or ask ChatGPT.

This experience will give you strong advantages in future job interviews, even if the company uses a different cloud provider.
Do not forget to mention that you have your own home lab and share links to your GitHub repositories.

The certificates show only that you know the right answers, but your home lab shows that you can more than that.

Good luck!

Kubernetes on Hybrid Cloud: Talos network

Serge Logvinov — Tue, 14 Jan 2025 12:08:29 +0000

The network management is an important part of a Kubernetes cluster, especially in hybrid and multi-cloud environments. The stability and predictability of the network are very important for the applications running on the cluster. The network is usually more stable in one physical location than in a cross-cloud environment.

The basic components can impact the stability of the application:

DNS resolving
Network stability
Network latency
Network bandwidth

DNS resolving

The application needs to resolve DNS names to IP addresses. By default, a Kubernetes cluster uses CoreDNS as its DNS server. CoreDNS is deployed as a Kubernetes deployment and can be scaled up or down. However, if the CoreDNS pods are very far from the application pod, latency may increase, and DNS names might fail to resolve.

To solve this issue, use a DaemonSet to deploy CoreDNS on each node. Additionally, set the TrafficPolicy for the CoreDNS service to Local Service traffic topology and routing. The DNS traffic will stay within the node, keeping the latency very low.

Network stability

For kubelet and kube-proxy, network stability is crucial. These components communicate with the Kubernetes API server to configure the network and run the pods. The kubelet also updates the status of the pods and node. If the status is not updated regularly, the Kubernetes API can mark the node as unhealthy, and the pods may be rescheduled to another node.

Imagine a situation where the pods and network are working fine, but the kubelet loses connection to the API server (for example, if the Kubernetes API load balancer goes down). Kubernetes will create new copies of the pods on another node, and the old pods will be terminated once the kubelet reconnects to the API server. For stateless applications, this behavior is usually not a problem. However, for stateful applications, like databases, it can cause significant issues.

Talos solves this problem by using an embedded load balancer on each node. The kubelet and kube-proxy (or CNI plugins) connect to the local load balancer, which forwards traffic to the API server. This ensures consistent connectivity and helps avoid unnecessary disruptions caused by API server load balancer failures.

You can switch it on by setting in machine configuration:

machine:
  features:
    kubePrism:
      enabled: true
      port: 7445

After this config, the Kubernetes API server becomes accessible on port 7445 on each node using the local host address.

Network latency and bandwidth

The best way to reduce network latency is to use native network routing. However, in hybrid and multi-cloud environments, this is not possible. The CNI (Container Network Interface) provides network overlays to address this issue, using technologies like VXLAN, GRE, or WireGuard. In all these cases, the network overlay adds an additional header to the packets, increasing network latency and reducing network bandwidth.

Talos includes an embedded network mesh based on WireGuard, a fast and secure VPN protocol that encrypts traffic between nodes. Regardless of where the nodes are located or whether they are behind NAT, the nodes can communicate with each other seamlessly.

However, since this mesh is an additional component in the network stack, it can introduce latency and some instability. The recovery process in case of issues can be slow and may take a long time.

The network mesh can be enabled in the machine configuration:

machine:
  network:
    kubespan:
      enabled: true
cluster:
  discovery:
    enabled: true

To reduce the recovery time, you can set filters to limit the IP addresses that can be used to create the tunnels. By specifying these filters, you can ensure that the network mesh uses only specific IP in ranges:

machine:
  network:
    kubespan:
      filters:
        endpoints:
          - 0.0.0.0/0
          - '::/0'
          - '!192.168.0.0/16'
          - '!172.16.0.0/12'
          - '!10.0.0.0/8'
          - '!fd00::/8'

Or opposite case. If you have both public and private networks and want to use only the private network for the mesh (because the public network is slower and more expensive), you can configure the network mesh to exclusively use the private network.

machine:
  network:
    kubespan:
      filters:
        endpoints:
          - '192.168.0.0/16'
          - '172.16.0.0/12'
          - '10.0.0.0/8'

If you want to establish a mesh network only between datacenters while using the native network for communication between nodes within each datacenter, consider using kilo

Kilo can deploy as CNI plugin that creates a WireGuard-based mesh network across Kubernetes zones, region and datacenters. It allows efficient and secure connectivity between nodes in different datacenters while maintaining native networking within each datacenter. This hybrid approach can optimize performance by reducing latency and overhead for intra-datacenter traffic while ensuring secure and reliable communication between datacenters.

Kubernetes on Hybrid Cloud: Bare-metal or Hypervisor

Serge Logvinov — Sun, 12 Jan 2025 08:22:31 +0000

It is a very popular question: What is the best choice to deploy a Kubernetes node - directly on bare metal or by setting up a hypervisor first and then deploying the Kubernetes node on VMs? There is no single correct answer: it depends on your requirements and the power of your hardware.

Let's compare both solutions.

Bare-metal installation

If you have a small bare-metal server with 32 cores (or less) and 64-128 GB of RAM, the best choice is to install the Kubernetes node directly on the server. By default, the kubelet has a limitation of 110 pods on the node. If your workload requires a larger number of pods with small resource requirements, you need to think about adjusting the following values:

Allocate the subnet which has more then 256 IPs (by default subnet is /24 and it has 256 IPs)
Reserving more resources for kubelet, each pod requires extra resources for kubelet
Adjusting secrets/config maps sync period
Check the NUMA and L3 cache architecture on the server
Run special daemonsets to manage power, monitoring hardware, and other features

Hypervisor installation

Imagine you have a powerful server with 128 cores and 512 GB of RAM or more. You deploy a Kubernetes node on this server. During maintenance, you lose all the server's capacity. But if you use a hypervisor, you can create virtual machines (VMs) and maintain them one by one, and keep the rest of the VMs running. Also, the Linux kernel and applications are not designed to work efficiently with such a large amount of RAM and CPU cores. The kernel spends time managing TLB, NUMA, and other CPU features. Splitting the server into VMs can help solve many of these problems.

Each VM will use a separate NUMA node and Memory, L3 cache, see CPU Affinity
You can maintain the VMs separately and do not lose the whole server capacity during maintenance, hypervisor kernel usually does nothing and work pritty stable. Restart whole server is not required so often
You can use the hypervisor features like snapshots, dynamic disk attachment and resizing
Kubernetes resources isolation, you can deploy different workloads on different VMs
Simplified node management, add or remove the kubernetes node becomes easier

The very famous open-source hypervisors are:

Using Proxmox as your hypervisor for Kubernetes can provide a cloud-like experience similar to well-known cloud providers, like:

Dynamic disk attachment, resizing (PV, PVC)
Network load balancing, firewall, and other features
Cluster bootstrapping by terraform

Kubernetes on Hybrid Cloud: Persistent storages

Serge Logvinov — Sat, 11 Jan 2025 07:07:24 +0000

Persistent storage is a key component of any production-grade Kubernetes deployment.

It is very challenging to rely on just one storage solution for all Kubernetes nodes, especially in hybrid cloud environments. In such cases, we can group the nodes based on their cloud or hypervisor type and use the storage solution that works best for each group. For example, one group of nodes might use a cloud provider's managed storage, while another might rely on local or on-premises storage.

However, many Kubernetes resources or operators are not designed to work seamlessly with multiple types of storage classes. For example, the StatefulSet resource requires a single type of PersistentVolumeClaim (PVC) template to create PVCs during the scale-up process. This means all PVCs created by the StatefulSet will use the same storage class as defined in the template. Once the PVC is created, changing the storage class is not straightforward (technically, it is possible, but it involves a complex process and risks downtime).

Statefulset and Persistent Volume Claim

How to use different storage classes for one statefulset?

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: test
spec:
  ...
  volumeClaimTemplates:
    - metadata:
        name: storage
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 1Gi
        storageClassName: "storage-class-1"

Let's assume we have two storage classes: storage-class-1 and storage-class-2. We want to have the first pod to use storage-class-1 and the second pod to use storage-class-2 in one statefulSet deployment.

By default, all PersistentVolumeClaims (PVCs) created by the StatefulSet will use storage-class-1 if it is specified in the PVC template. To assign storage-class-2 to the second pod, we need to manually create a PVC with storage-class-2 before scaling up the StatefulSet. After creating this PVC, we can scale up the StatefulSet, and the second pod will automatically use the pre-created PVC with storage-class-2.

Additionally, it is possible to permanently change the storage class of PVCs after the StatefulSet has been created. However, this process is not straightforward and typically involves detaching the workload, editing the PVCs, and carefully managing the transition to avoid downtime.

How we can change the storage class of the statefulSet without downtime:

# Delete the statefulSet resources without deleting the pods
kubectl delete statefulset test --cascade=orphan
# Change the storage class in statefulset yaml, apply it and scale up the statefulSet
kubectl apply -f statefulset.yaml

The flag --cascade=orphan is very important because it will not delete the pods, and workloads will not be interrupted. After applying the new statefulSet, the new pods will be created with the new storage class.

Operator and Persistent Volume Claim

Using an operator, we can also configure different storage classes, similar to what we did with the StatefulSet. However, there is an additional step involved when working with operators. Before making any changes to the storage classes, we need to scale down the operator. This is because the operator continuously manages and reconciles the resources it controls, and it may revert any manual changes we make to align with its predefined configurations.

Once the operator is scaled down, we can manually create or modify the PVCs to use different storage classes. After completing the changes, we can safely scale the operator back up to resume its management tasks.

Dynamic Persistent Volume Claim

All of the above solutions have a major drawback: they require us to know the exact name of the PVC that will be created by the StatefulSet or the operator. This can make the process hard and error-prone, especially in dynamic or large-scale environments.

A better way to solve this problem is to use a universal, dynamic approach for PVC creation. One solution is to leverage a special StorageClass resource. With this method, Kubernetes can automatically create PVCs and PVs with the appropriate storage class based on the cloud environment or node type. This approach removes the need for manual intervention or pre-defining PVC names and simplifies storage management in hybrid or multi-cloud environments. It ensures that storage provisioning is both seamless and adaptable to the underlying infrastructure.

This solution is already available with the Hybrid CSI. The Hybrid CSI plugin acts as a proxy between Kubernetes and existing CSI plugins. It does not have its own storage implementation or any specific knowledge of cloud providers. Instead, it forwards PVC creation requests to the appropriate CSI plugin during the PVC creation process.

With the Hybrid CSI plugin, you can configure priorities for different types of CSI plugins. During PVC provisioning, the Hybrid CSI plugin will select and use the first available plugin based on the defined priorities. This makes it an efficient and flexible solution for managing storage across diverse environments without needing to hard-code storage configurations or manage complex PVC naming schemes.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: hybrid
parameters:
  storageClasses: proxmox,hcloud-volumes
provisioner: csi.hybrid.sinextra.dev
allowVolumeExpansion: true
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer

In this example, we have a StorageClass named hybrid, which combines two underlying storage classes: proxmox and hcloud-volumes.

The proxmox storage class is used for the Proxmox hypervisor.
The hcloud-volumes storage class is designed for the Hetzner Cloud.

When a pod is deployed, the Hybrid CSI plugin determines the underlying environment where the pod is running. If the pod is running on a Proxmox hypervisor, the plugin forwards the PVC creation request to the proxmox storage plugin. Conversely, if the pod is running on Hetzner Cloud, the plugin uses the Hcloud CSI plugin to handle the PVC creation.

Example

So we've deployed the statefulSet with the hybrid storage class with two replicas.

$ kubectl -n default get pods,pvc
NAME         READY   STATUS    RESTARTS   AGE
pod/test-0   1/1     Running   0          31s
pod/test-1   1/1     Running   0          31s

NAME                                   STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
persistentvolumeclaim/storage-test-0   Bound    pvc-64440564-75e9-4926-82ef-280f412b11ee   1Gi        RWO            hybrid         <unset>                 32s
persistentvolumeclaim/storage-test-1   Bound    pvc-811cc51e-9c9f-4476-92e1-37382b175e7f   10Gi       RWO            hybrid         <unset>                 32s

After deploying the StatefulSet, we can see that the PersistentVolumes (PVs) were created using the proxmox and hcloud-volumes storage classes. They have different sizes, because the Hetzner Cloud has a minimum size of 10Gi, proxmox has a less than 1Gi.

$ kubectl -n default get pv pvc-64440564-75e9-4926-82ef-280f412b11ee pvc-811cc51e-9c9f-4476-92e1-37382b175e7f
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                    STORAGECLASS     VOLUMEATTRIBUTESCLASS   REASON   AGE
pvc-64440564-75e9-4926-82ef-280f412b11ee   1Gi        RWO            Delete           Bound    default/storage-test-0   proxmox          <unset>                          84s
pvc-811cc51e-9c9f-4476-92e1-37382b175e7f   10Gi       RWO            Delete           Bound    default/storage-test-1   hcloud-volumes   <unset>                          81s

Conclusion

The Hybrid CSI plugin is an excellent solution for hybrid cloud environments. It enables you to use different storage classes for different nodes without needing to modify StatefulSet or operator resources. The plugin simplifies storage management by dynamically forwarding PVC creation requests to the appropriate storage class based on the underlying infrastructure.

This plugin is easy to use and requires no additional configuration once set up. You can even configure it as the default storage class for all your deployments, allowing you to manage storage seamlessly without worrying about specifying or switching between storage classes. This makes it a powerful and flexible tool for modern hybrid cloud Kubernetes deployments.

References

Kubernetes on Hybrid Cloud: Network design

Serge Logvinov — Mon, 30 Dec 2024 11:35:58 +0000

In hybrid cloud environments, network design is one of the most important and fundamental parts. It is a basic requirement for any type of Kubernetes cluster, whether it is a single-zone cluster or a multi-regional cluster. In this article, we will explain how to design a network for a hybrid cloud Kubernetes cluster.

Key Considerations for Network Design:

Ensure reliable and secure connections between on-premises and cloud data centers.
Optimize network latency
Plan IP address ranges to avoid conflicts between data centers
Use firewalls to control traffic between networks

Communication inside the cluster

Node-to-node:

All nodes in the cluster must have connectivity to each other. Control plane nodes must have access to the kubelet on each node. Nodes must have access to the control plane nodes. If you going to use Kubernetes Admission Controllers, the control plane need access to the pods too.

Pod-to-pod:

Pods must have access to the control plane nodes to reach the Kubernetes API. Monitoring systems must have access to all nodes to collect metrics and logs. Pods must be able to communicate with other pods within the same cluster. In many cases, pods also need access to the internet and cloud provider services. The latency between nodes, zones, and regions is a crucial factor. Application deployments should be aware of the network topology to optimize performance. Furthermore, when regions are located in different cloud providers, communication can become unstable and slower. Proper planning and optimization are required to address these challenges.

In summary, all pods and nodes must have connectivity with each other. A hybrid Kubernetes cluster has multiple zones and regions, adding more complexity to network design.

Network design

Hardware or Saas VPN

The well-known cloud providers offer VPN services. VPN links can be established between zones and regions, enabling secure and stable connections across different geographical locations. VPN connections can be hardware-based or software-based, depending on the infrastructure requirements. VPN helps in maintaining data security and encryption during transmission.

The red line represents the VPN connection between cloud providers. The VPN connection is encrypted and secure. However, this link uses public internet connections, which can sometimes be less reliable and have unstable bandwidth.

Direct Connect

Direct Connect is a dedicated network connection service provided by major cloud providers (e.g., AWS Direct Connect, Azure ExpressRoute, Google Cloud Interconnect). It allows organizations to establish a private, high-bandwidth connection between their the cloud provider's infrastructure.

The latency between cloud providers is very low, and bandwidth is predictable. This link uses dedicated fiber optic cables, which are more reliable than the public internet.

Mesh network

A mesh network is a network topology where each node establishes a connection with every other node. Unlike traditional hierarchical networks, mesh networks are decentralized, allowing every node to cooperate in data distribution and routing. All connections are encrypted and secure.

Known mesh solutions:

Public IPv6 for Pods and Services

A Kubernetes cluster supports IPv6 for both Pods and Services, allowing seamless communication across regions without requiring any special configuration for cross-region traffic. This native IPv6 support simplifies networking and ensures scalability for modern cloud-native workloads.

However, enabling IPv6 alone is not enough to guarantee a secure communication environment. Ensure that your applications uses mutual TLS (mTLS) or other security mechanisms to protect data in transit.

See more details here

Note that not all cloud providers fully support IPv6 for Kubernetes Pods and Services. Check with your cloud provider for the latest information on IPv6 support.

Talos on GCP with Spot Instances

Serge Logvinov — Sat, 28 Dec 2024 12:35:37 +0000

Using Spot Instances on Google Cloud Platform (GCP) is an excellent way to reduce infrastructure costs. This guide explains how to run Talos on GCP with Spot Instances effectively.

One unique behavior of GCP Spot Instances is that they lose their IP address when they are preempted (stopped by Google). Most Container Network Interfaces (CNI) can handle this and update the node's IP address. However, if you're using IPv6 for your Pods, the CNI cannot update the CIDR (IP range) resources assigned to the node. This means that after preemption, you cannot run Pods on that node with proper IPv6 addresses.

This guide explains how to work around this issue and run Talos on GCP with Spot Instances effectively.

Configurations

Talos Cloud Controller Manager (CCM) is designed to work across different environments, including Google Cloud Platform (GCP). It can handle IP address changes when a node is preempted. To enable this functionality, you need to activate the cloud-node-lifecycle controller in your Talos configuration.

# Helm values for Talos CCM
enabledControllers:
  - cloud-node
  - cloud-node-lifecycle

How It Works:

Talos CCM watches for node eviction events.
When a node is preempted, Talos CCM removes the node's resources from the cluster.
When the node starts again, Talos CCM adds the updated node resources back to the cluster.
For the CNI plugin, it will appear as if the node was replaced with a new one.

Additionally, Talos CCM marks Spot Instances with the label:

node.cloudprovider.kubernetes.io/lifecycle: spot

You can use this label to schedule your Pods based on your specific requirements.

Conclusion

Running Talos on any cloud provider with Talos Cloud Controller Manager (CCM) is an excellent way to manage your Kubernetes cluster efficiently. Talos CCM can handle IP address changes when a node is preempted and helps you effectively manage Spot Instances for better resource optimization and cost savings.

References

Kubernetes on Hybrid Cloud: Talos Cloud Controller Manager (CCM)

Serge Logvinov — Fri, 27 Dec 2024 12:34:40 +0000

Talos is a modern operating system designed specifically for Kubernetes. It supports various cloud providers, including AWS, Azure, Google Cloud, OpenStack, and on-premises environments. Talos focuses on security, simplicity, and ease of use. Because Talos nodes are aware of the cloud environment they are running in, the concept of Talos Cloud Controller Manager (CCM) was created

The Talos Cloud Controller Manager (CCM) is built to work effectively in hybrid cloud environments. It collects information from Talos nodes and provides an easy and secure way to manage the lifecycle of Kubernetes nodes in the cluster.

The components of Talos CCM:

cloud-node: is responsible for initializing and managing new nodes when they join a cluster.
cloud-node-lifecycle: Talos CCM does not have any credentials to make a call to the API of the cloud provider. What is why this component is not supported. You need to run native CCM from the cloud provider only with cloud-node-lifecycle component, all other components should be disabled.
node-route-controller: Not supported.
service-lb-controller: Not supported.
node-ipam-controller: Manages IP addresses for the Pods in the cluster.
node-csr-approval: Approves the certificate requests from the nodes.

Controller cloud-node

The cloud-node component is responsible for initializing and managing new nodes when they join a cluster. It performs tasks like: registering and attaching nodes to the Kubernetes cluster, adding important labels, and preparing the node for workloads.

The important labels for hybrid environments are:

topology.kubernetes.io/region specifies the region of the node
topology.kubernetes.io/zone specifies the availability zone
node.kubernetes.io/instance-type specifies the instance type of the node
node.cloudprovider.kubernetes.io/platform specifies the cloud platform where the node is running

These labels are essential for Kubernetes to place workloads (Pods) on the correct nodes based on regional and infrastructure requirements.

In bare-metal environments or environments not recognized by Talos, you can predefine rules for the Cloud Controller Manager. These rules act as a configuration blueprint and are applied to nodes based on their metadata. Official documentation.

transformations:
  # All rules are applied in order, all matched rules are applied to the node

  - name: nocloud-nodes
    # Match nodes by nodeSelector
    nodeSelector:
      - matchExpressions:
          - key: platform           <- talos platform metadata variable case insensitive
            operator: In            <- In, NotIn, Exists, DoesNotExist, Gt, Lt, Regexp
            values:                 <- array of string values
              - nocloud
    # Set labels for matched nodes
    labels:
      pvc-storage-class/name: "my-storage-class"

  - name: web-nodes                 <- transformation name, optional
    nodeSelector:
      # Or condition for nodeSelector
      - matchExpressions:
          # And condition for matchExpressions
          - key: platform           <- talos platform metadata variable case insensitive
            operator: In            <- In, NotIn, Exists, DoesNotExist, Gt, Lt, Regexp
            values:                 <- array of string values
              - metal
          - key: hostname
            operator: Regexp
            values:
              - ^web-[\w]+$         <- go regexp pattern
    labels:
      # Add label to the node, in this case, we add well-known node role label
      node-role.kubernetes.io/web: ""

The first rule applies to nodes on nocloud platform environment, such as Proxmox or Oxide. It sets the pvc-storage-class/name label to node.
The second rule for metal platform with hostnames starting with web-. It sets the node-role.kubernetes.io/web label, which means the node role.

Controller cloud-node-lifecycle

The cloud-node-lifecycle component plays a crucial role in managing the lifecycle of Kubernetes nodes by interacting with the cloud provider's API. Its main responsibilities include:

Verifying if a virtual machine (instance) is still running in the cloud provider's infrastructure.
Ensuring the node is still registered and valid within the cloud environment.
Monitoring the overall health and status of the node from the cloud provider's perspective.

However, Talos CCM does not have credentials to access cloud provider APIs. To manage the cloud-node-lifecycle functionality, you need to run the native CCM provided by your cloud provider. You can run as many CCMs as you have cloud environments in cloud-node-lifecycle mode with simple changes in the code.

Controller node-route-controller

Not supported.

Controller service-lb-controller

Not supported.

Controller node-ipam-controller

This controller makes sense only if you want to have global IPv6 addresses for your Pods. Kubernetes already supports IPv6 for Pods, but only within a single segment. Talos CCM can handle as many segments as you have nodes. The controller allocates an IPv6 segment from the node's range and assigns it to each Pod. See mode detailed example Kubernetes PODs with global IPv6

Controller node-csr-approval

The Kubernetes API server must communicate with the Kubelet API to retrieve logs, metrics, and other node-level information. This communication happens over a secure channel, and the Kubelet API is protected by a TLS certificate. By default, when a Kubelet starts, it generates a self-signed certificate to enable TLS communication. However, the Kubernetes API server cannot inherently trust this self-signed certificate because there’s no verification from a Certificate Authority (CA).

To address this issue, Kubernetes uses the node-csr-approval controller, which is responsible for:

When a node starts, the Kubelet generates a Certificate Signing Request (CSR) and submits it to the Kubernetes API server.
The node-csr-approval controller validates the CSR to ensure it meets the security and policy requirements.
If the CSR is valid, the controller approves it, and the Kubernetes Certificate Authority (CA) signs the certificate.
The signed certificate is sent back to the Kubelet for use in secure communication.

Conclusion

Whether you're running Talos in a hybrid cloud environment or in an on-premises setup, Talos CCM is an essential tool for managing your Kubernetes cluster efficiently.

Talos CCM simplifies and streamlines the lifecycle management of nodes, automating key processes such as node initialization, labeling, and certificate management. It ensures that every node in your cluster is properly registered, labeled, and ready to serve workloads in a secure and consistent way.

In hybrid cloud deployments, Talos CCM makes it easier to integrate nodes across different environments, ensuring smooth communication and consistent behavior across the cluster, regardless of where your nodes are hosted. By handling these tasks automatically, Talos CCM reduces operational complexity, minimizes human error, and saves time for administrators. In short:

If you're using Talos, using Talos CCM isn't just an option—it's a best practice.

References

Kubernetes on Hybrid Cloud: Cloud Controller Manager (CCM)

Serge Logvinov — Thu, 26 Dec 2024 08:47:38 +0000

Almost all Kubernetes services have cloud integration. You might not even know you're using it. For example, when you create a LoadBalancer service, Kubernetes will automatically create a cloud load balancer on the cloud provider's side for you. This is a great feature because you don’t need to worry about the cloud provider's API or how to set up an external load balancer.

In a self-hosted Kubernetes cluster or a hybrid cloud environment, you need to deploy a Cloud Controller Manager (CCM) to enable cloud integration. The CCM is a Kubernetes component that communicates with the cloud provider's API to create and manage cloud resources. It serves as a bridge between Kubernetes and the cloud provider, allowing you to use features such as load balancers, block storage, and network routing.

Here's what each key component of CCM does:

cloud-node: Initializes new nodes during the scale-up process. It ensures nodes are correctly registered in the cluster with the necessary cloud-specific metadata.
cloud-node-lifecycle: Removes nodes from the cluster during the scale-down process. It detects and handles unhealthy or deleted nodes based on information from the cloud provider.
node-route-controller: Creates and manages routes inside the cloud provider's network. It ensures nodes can communicate with each other across the cloud infrastructure.
service-lb-controller: Creates and manages external load balancers for LoadBalancer type services in Kubernetes. It ensures external traffic can reach the appropriate pods through cloud-managed load balancers.

Controllers cloud-node and cloud-node-lifecycle

The cloud-node controller is responsible for initializing new nodes in the cluster. When a new node is added to the cluster, either by an auto-scaler or manually, the cloud-node controller communicates with the cloud provider's API to register and attach the node to the cluster. It also adds labels and taints to the node, such as:

topology.kubernetes.io/region
topology.kubernetes.io/zone
node.kubernetes.io/instance-type

These labels help the Kubernetes scheduler place pods on the correct nodes. For example, you can configure a pod to run only in a specific region, zone, or cloud provider environment.

The cloud-node-lifecycle controller is responsible for removing nodes from the cluster. When a node is deleted—either manually, due to failure, or by an auto-scaler—the cloud-node-lifecycle controller ensures that the node is properly removed from the cluster's state. It also handles:

Detaching cloud resources associated with the node.
Ensuring all pods running on the removed node are rescheduled to other healthy nodes.

These controllers work together to ensure node lifecycle management aligns with both the Kubernetes cluster state and the cloud provider's infrastructure.

Controller node-route-controller

The node-route-controller is responsible for creating and managing network routes inside the cloud provider's network. When a new node is added to the cluster, the node-route-controller communicates with the cloud provider's API to create a route that directs traffic to the new node.

How It Works:

The controller ensures traffic can flow between nodes, even if they are spread across different availability zones.
It is particularly useful when you want to route traffic from an external load balancer directly to the pods on the nodes.

Not all cloud providers support this feature. Support depends on the cloud provider's networking capabilities. It typically works best in one cloud-only environments where routing is fully managed by the cloud provider.

In hybrid cloud environments, the node-route-controller introduces additional complexity because you need to manage pod-subnet routes between: on-premises infrastructure, different cloud environments.

This means configuring consistent routing tables across both environments to ensure traffic flows correctly. Misconfigurations or lack of routing support can cause network connectivity issues between pods across on-premises and cloud nodes.

You can check the pod-subnets and node IPs by commands:

# Pod CIDR
kubectl get nodes -o jsonpath='{range .items[*].spec}{.podCIDRs}{"\n"}{end}'; echo
# Node IPs
kubectl get nodes -o jsonpath='{range .items[*].status.addresses[?(@.type=="InternalIP")]}{.address}{"\n"}{end}'

Output should be like this:

# Pod CIDR
["10.32.7.0/24","fd00:10:32::7:0/112"]
["10.32.10.0/24","fd00:10:32::a:0/112"]
# Node IPs
172.16.2.100
172.16.2.101

In a Kubernetes cluster, node IPs usually come from the same subnet, meaning all nodes are part of one network range. However, the pods on each node get their IP addresses from different subnets (Pod CIDRs). The node-route-controller solves this by creating routes in the cloud provider's network. These routes tell the cloud network how to reach each pod subnet through its corresponding node.

From the cloud provider's side it will look like this:

ip route add 10.32.7.0/24 gw 172.16.2.100
ip route add 10.32.10.0/24 gw 172.16.2.101

After the node-route-controller sets up the routes in the cloud provider's network, cloud services like LoadBalancer will know how to route traffic to the pods.

Controller service-lb-controller

The service-lb-controller is responsible for creating and managing load balancers in the cloud provider's network. When you create a LoadBalancer service in Kubernetes, the service-lb-controller will create a cloud load balancer in the cloud provider's API and route traffic to the service's pods through the node ports or directly to the pods.

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  loadBalancerClass: my-cloud-provider
  allocateLoadBalancerNodePorts: true
  selector:
    app: MyApp
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
  clusterIP: 10.0.171.239
  type: LoadBalancer
status:
  loadBalancer:
    ingress:
      - ip: 192.0.2.127

The allocateLoadBalancerNodePorts option tells the service-lb-controller to create a node port for the service, in case if the cloud provider does not support direct routing to the pods. The cloud provider will create a load balancer and route traffic to the node port, which will then forward it to the service's pods. Default value is true.

The loadBalancerClass option specifies the cloud provider's load balancer class to use. This is useful in hybrid cloud environments, otherwise, the CCM with service-lb-controller could create a cloud load balancer in all cloud providers where the cluster is running. The default value is undefined.

The status.loadBalancer.ingress field shows the IP address of the cloud load balancer. You can use this IP address to access the service from outside the cluster. This IP from different subnets of you kubernetes cluster. It can be a public IP address or a private IP address, depending on the cloud provider's configuration.

Hybrid Cloud Considerations

When deploying a Cloud Controller Manager (CCM) in a hybrid cloud environment, there are several key points to keep in mind:

Most CCMs are designed to work only in their respective cloud environments. They assume full control over nodes, routes, and load balancers within their cloud infrastructure.
LoadBalancer services are tightly integrated with the cloud provider's network. In a hybrid cloud environment, LoadBalancer services may not work properly, or their setup may require manual configuration and custom routing rules.
The cloud provider's network may not be able to route traffic directly to pod IPs located in a different cloud or on-premises environment. You may need to manually configure custom routes or use dedicated networking solutions (e.g., VPNs, interconnects, or SD-WAN) to ensure connectivity.
Running two CCMs in a single Kubernetes cluster is tricky and can cause conflicts. Both CCMs may attempt to manage the same nodes, routes, or load balancers. Without clear separation of responsibilities, you might encounter unexpected behavior or resource management issues.

Recommendations:

The node-route-controller and service-lb-controller may not work as expected in a hybrid cloud environment. Just disable them.
The cloud-node-lifecycle can remove the nodes from another cloud provider's environment, because it doesn't know about this nodes on cloud provider's side.
The nodes from one cloud provider may not initialize properly in another provider's environment. However, the CCM specific to each environment will usually handle node initialization correctly. So cloud-node keeps working well.
cloud-node-lifecycle controller may remove nodes from another cloud provider because it doesn’t recognize them. Mismanagement can cause healthy nodes to be removed unintentionally. To prevent this we need to add changes to the code to avoid removing nodes from another cloud provider.

Conclusion

The Cloud Controller Manager (CCM) is a critical component for managing cloud resources, enabling dynamic scaling, and ensuring nodes are properly prepared in cloud environments. It acts as the bridge between Kubernetes and cloud infrastructure, automating resource management and simplifying cluster operations.

However, in hybrid cloud environments, traditional CCMs face limitations due to provider-specific designs, network routing complexities, and conflicts between multiple CCMs. With a few key adjustments to the CCM code and configuration settings, you can use multiple cloud providers in a single Kubernetes cluster without compromising performance or reliability.

You can find the CCM changes in my repositories. The core improvement involves adding a processing delay during node initialization in the CCM. One CCM gets sufficient time to fully initialize the node before another CCM takes any action. Once a node is properly initialized by its corresponding cloud provider's CCM, other CCMs will skip interfering with the node’s state.

cloud-provider the base package for the cloud provider implementation
images with changes
fluxcd example to deploy them

Kubernetes on Hybrid Cloud: Service traffic topology and routing

Serge Logvinov — Wed, 25 Dec 2024 10:28:27 +0000

Kubernetes is a powerful tool for managing container workloads in data centers and cloud systems. However, using Kubernetes in a hybrid cloud (both on-premises and cloud) has challenges, especially with networking and traffic routing. By default, Kubernetes services use a round-robin algorithm to share traffic between nodes. But this method is not always the best for hybrid clouds because nodes are spread across different locations.

In this article, we will explain why traffic routing is important in a hybrid Kubernetes environment and share best practices for improving network traffic performance.

Why Traffic Topology and Routing Matter

In a hybrid Kubernetes environment, workloads run both in on-premises data centers and cloud providers. This setup can cause poor network traffic routing, leading to higher latency and lower performance. If your cloud costs depend on data transfer, this can also make your bills higher.

By optimizing traffic routing, you can make sure that network traffic moves efficiently between nodes. Sometimes, traffic might not even need to leave the physical server, saving time and costs. This helps to reduce latency, improve performance, and lower your monthly expenses.

Internal Traffic Policy

Kubernetes has an internal traffic policy that lets you control how traffic moves inside the cluster. By default, Kubernetes services use the Cluster traffic policy. This policy sends traffic to any available endpoint within the cluster, no matter where it is located.

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
    type: ClusterIP
    selector:
        app: MyApp
    ports:
        - protocol: TCP
        port: 80
        targetPort: 8080
    internalTrafficPolicy: Local

The internalTrafficPolicy: Local setting makes sure that traffic only goes to endpoints on the same node. This is helpful for workloads needing low latency and high speed. However, it can also lead to dropped requests if the pods are not running on the same node.

Production use cases:

Ingress controllers (e.g., Traefik or Nginx) often use this policy to handle traffic from an external load balancer. The load balancer checks the Ingress controller pod and sends traffic to a healthy one.
DaemonSets which run on every node in the cluster, can also use this policy to communicate with other pods on the same node. A good example is CoreDNS. Since all pods need to resolve DNS names, it makes sense to run CoreDNS on every node.

Official documentation.

Topology Aware Routing

Kubernetes also supports topology-aware routing, which helps optimize network traffic based on the physical location of nodes. This feature lets you set rules for services to make sure traffic stays within the same region or availability zone whenever possible. This improves performance, reduces latency, and can help lower costs for cross-zone or cross-region traffic.

apiVersion: v1
kind: Service
metadata:
  name: my-service
  annotations:
    service.kubernetes.io/topology-mode: Auto
spec:
    type: ClusterIP
    selector:
        app: MyApp
    ports:
        - protocol: TCP
        port: 80
        targetPort: 8080

In Auto mode, topology-aware routing will automatically send traffic to the nearest endpoint in the current zone. This works only if each zone has at least one endpoint. If one or more zones have no endpoints, traffic will follow the default round-robin routing instead.

So, the number of endpoints should be equal to or greater than the number of zones to ensure proper routing. I do not recommend using this method for hybrid clouds. In many cases, you might not have endpoints in every zone. This can cause traffic to fall back to round-robin routing, which may not be optimal for performance or cost in a hybrid setup.

Official documentation

Traffic distribution

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
    type: ClusterIP
    selector:
        app: MyApp
    ports:
        - protocol: TCP
        port: 80
        targetPort: 8080
    trafficDistribution: PreferClose

The trafficDistribution: PreferClose setting helps to send traffic to the closest endpoint in the zone. If there is no endpoint in the current zone, traffic will be sent to other zones using round-robin routing. This method does not require an endpoint in every zone, which makes it more suitable for hybrid cloud environments.

Production use cases:

Ingress controllers To serve traffic to the closest endpoint in the same zone for better performance.
Cache services (e.g., Redis or Memcached) to ensure faster access by reaching the nearest endpoint.
Pod-to-service communication in microservices architecture, traffic between pods and services can use nearby endpoints to reduce latency.

Official documentation

Conclusion

In a hybrid Kubernetes environment, traffic routing is crucial for optimizing network performance and reducing costs. By using internal traffic policies and traffic distribution settings, you can make sure that traffic moves efficiently between nodes. This helps to reduce latency, improve performance, and lower your monthly expenses.

Kubernetes on Hybrid Cloud dream or reality?

Serge Logvinov — Tue, 24 Dec 2024 13:03:11 +0000

Many organizations are interested in running Kubernetes on a hybrid cloud setup. This usually becomes a priority after major outages in their data centers or receiving unexpectedly high bills from their cloud provider. The goal is to leverage the flexibility and scalability of Kubernetes while avoiding vendor lock-in.

What is Hybrid Cloud

Running Kubernetes on a hybrid cloud means operating a single Kubernetes cluster across on-premises infrastructure and public cloud environments. Public clouds might include well-known providers such as AWS, Azure, or Google Cloud, while private clouds could involve platforms like OpenStack, Proxmox, or even bare-metal servers.

Benefits of Hybrid Cloud Kubernetes:

Cost Efficiency: Run baseline workloads on bare-metal servers and scale using public clouds when extra capacity is needed.
Flexibility: Choose the best cloud provider for specific workloads based on cost or performance.
Unified Management: Manage all workloads from a single Kubernetes cluster.
Vendor Independence: Avoid being locked into a single cloud provider.
Database Solutions: Run database operators directly on Kubernetes, similar to how managed databases work in the cloud.
Disaster Recovery: Keep standby nodes in the cloud for quick failover during outages.

Challenges of Running Kubernetes on Hybrid Cloud

While the benefits are significant, running Kubernetes on a hybrid cloud is not without challenges:

Cloud Integration: Native integration with cloud services is essential for smooth operation.
Networking: Secure and reliable connections between on-premises and cloud environments are crucial.
Latency: Minimize latency by optimizing traffic routing.
Security: Implement robust measures to secure workloads and data across environments.

Existing Solutions for Hybrid Kubernetes

Kubernetes already offers several tools and features for managing hybrid cloud clusters effectively:

Talos: A secure, immutable, and minimal operating system designed for Kubernetes.
Cloud Controller Manager: Helps manage Kubernetes nodes in cloud environments.
Persistent Volumes: Support for multiple storage types, enabling on-premises and cloud storage usage within the same cluster.
IPv6 for Pods and Services: Allows seamless pod-to-pod communication without the need for NAT.
Internal Traffic Policy: Controls how traffic is routed within the cluster.
Topology Aware Routing: Optimizes network traffic routing based on the physical location of nodes.
Traffic distribution: Distributes traffic across multiple zones.

Networking Setup

Networking is one of the most important aspects of a hybrid Kubernetes cluster. You must ensure:

A secure VPN or direct connection between on-premises and cloud networks.
Consistent IP addressing and DNS resolution.
Proper network policies for traffic management.
Load balancing across on-premises and cloud environments.

Network architecture is very important for hybrid cloud Kubernetes. It very hard to change after the cluster is up and running.

Cluster Configuration

When setting up your cluster:

Define which workloads run on-premises and which in the cloud.
Use node selector/affinity to control where specific pods run.

Control plane configuration

Kubernetes uses etcd as its database to store cluster state. It is crucial to ensure etcd remains highly available and secure. Here are some best practices:

Use an odd number of etcd nodes: This ensures high availability and avoids split-brain scenarios.
Run across multiple availability zones or cloud providers: Ensure low latency between nodes and select providers with minimal latency.
Regularly back up etcd data: Prevent data loss with automated backups.
Disaster Recovery Plans: Test and train disaster recovery plans for etcd failures.

Cloud integration

Most of the cloud providers offer addons to integrate with Kubernetes. These addons was disaigned to run on the cloud provider's Kubernetes service. However, you can run them on your hybrid cluster as well, with some modifications.

CCM (Cloud Controller Manager) requires the changes in the code, check my already modified version
CSI (Container Storage Interface) is a standard for exposing storage systems to containerized workloads on Kubernetes. Most of them works well in hybrid cloud setup.
Cluster Node Autoscaler also requires some changes

Monitor and Optimize

Monitoring tools like Prometheus and Grafana can help track the health and performance of your hybrid cluster. Additionally:

Use autoscalers to manage resource usage efficiently.
Set alerts for potential issues.
Regularly review and optimize your cluster setup.

Conclusion

Running a Kubernetes cluster in a hybrid environment offers flexibility and scalability, but it requires careful planning and execution.
To find out more about Kubernetes on hybrid cloud, check out the following resources:

Talos on Hybrid Cloud to deploy Talos on different cloud providers
Fluxcd to install base addons in hybrid cluster
Container images with modified cloud addons
Helm charts for deploying applications in hybrid cluster

Proxmox Virtual Machine optimization

Serge Logvinov — Mon, 23 Dec 2024 05:26:50 +0000

Proxmox Virtual Environment (VE) is a powerful open-source virtualization platform used to manage virtual machines (VMs). To make your VMs run faster, you need to set them up correctly.

Common optimizations

CPU settings
Memory settings
Network configuration
Disk storage type

CPU Settings

CPU Type

If you do not want to use live migration, you need to set the CPU type to host. This setting allows the VM to use all features of the physical CPU but disables live migration entirely, which might not be suitable for all environments.

CPU Affinity

Set dedicated CPU by setting CPU affinity for the VM. This setting will prevent the VM from using other CPUs.

Turn on NUMA (Non-Uniform Memory Access) for better memory handling on large servers. Choose the right CPU cores in one NUMA node.

More details

Memory Settings

Memory Huge Pages

Consider using huge pages for better memory performance. Huge pages are larger than normal pages and can improve memory access speed. It allocates memory in 2MB or 1GB chunks, dyring the boot process. And this memory is reserved for the VM only, the hypervisor cannot use it.

More details

Network Configuration

Use VirtIO Network Driver

Always use VirtIO drivers for network cards. VirtIO drivers are paravirtualized drivers for network and disk devices. They are faster than the default drivers. Enable Multiqueue queues the same as the number of CPU cores.

Enable Jumbo Frames

Allow Jumbo Frames for larger network packets. You can enable Jumbo Frames in Proxmox by setting the MTU (Maximum Transmission Unit) value to 9000 on the network interface configuration.

SR-IOV

Use Single Root I/O Virtualization (SR-IOV) for better network performance. SR-IOV allows a single physical network card to appear as multiple virtual network cards. After enabling SR-IOV, you can assign a virtual function to the VM. VM will have direct access to the physical network card. The Hypervisor will not be involved in network processing.

More details

Disk Storage Type

The disk storage type is one of the most important factors for VM performance.
Local storage is faster than network storage. SSD is faster than HDD.
LVM (not a thin) is faster than ZFS. ZFS is faster than NFS.

For the safety of your data, use the soft RAID 1 or 10 for the local storage.

for 2 disks use RAID 1 with mdraid implementation
for 4 disks or more try to use ZFS

LVM backend is faster than other backends for the local storage.
ZFS backend works better with large storage, but it requires more memory and CPU.

Use VirtIO/SCSI VirtIO drivers

Always use VirtIO drivers for disk devices. VirtIO drivers are paravirtualized drivers for disk devices. They are faster than the default drivers.

SR-IOV

For the huge storage performance, you can use SR-IOV for disk devices. You can assign NVME directly to the VM. The VM will have direct access to the physical disk, this performance will have performance as on the host machine.