DEV Community: Daniele Polencic

Sticky sessions and canary releases in Kubernetes

Daniele Polencic — Mon, 19 Jun 2023 12:17:23 +0000

Sticky sessions or session affinity is a convenient strategy to keep subsequent requests always reaching the same pod.

Let's look at how it works by deploying a sample application with three replicas and one service.

In this scenario, requests directed to the service are load-balanced amongst the available replicas.

Let's deploy the ingress-nginx controller and create an Ingress manifest for the deployment.

In this case, the ingress controller skips the services and load balances the traffic directly to the pods.

While the two scenarios end up with the same outcome (i.e. requests are distributed to all replicas), there's a subtle (but essential) distinction: the Service operates on L4 (TCP/UDP), whereas the Ingress is L7 (HTTP).

Unlike the service, the Ingress controller can route traffic based on paths, headers, etc.

You can also use it to define weights (e.g. 20-80 traffic split) or sticky sessions (all requests from the same origin always land on the same pod).

The following Ingress implements sticky sessions for the nginx-ingress controller.

The ingress writes a cookie on your browser to keep track of what instance you visited.

There are two convenient settings for affinity:

balanced — requests are redistributed if the deployment scales up.
persistent — no matter what, the requests always stick to the same pod.

nginx-ingress can also be used for canary releases.

If you have two deployments and you wish to test a subset of the traffic for a newer version of that deployment, you can do so with a canary release (and impact a minimal amount of users).

In a canary release, each deployment has its own Ingress manifest.

However, one of those is labelled as a canary.

You can decide how the traffic is forwarded: for example, you could inspect a header or cookie.

In this example, all traffic labelled east-us is routed to the canary deployment.

You can also decide which fraction of the total traffic is routed to the canary with weights.

But if the header is omitted in a subsequent request, the user will return to see the previous deployment.

How can you fix that?

With sticky sessions!

You can combine canary releases and sticky sessions with ingress-nginx to progressively (and safely) roll out new deployments to your users.

It's important to remember that those types of canary releases are only possible for front-facing apps.

To roll out a canary release for internal microservices, you should look at alternatives (e.g. service mesh).

Is nginx-ingress the only option for sticky sessions and canary releases?

Not really, but the annotations might be different to other ingress controllers.

At Learnk8s, we've put together a spreadsheet to compare them.

And finally, if you've enjoyed this thread, you might also like:

While authoring this post, I also found the following resources valuable:

What happens when you create a Pod in Kubernetes

Daniele Polencic — Tue, 30 May 2023 17:54:17 +0000

What happens when you create a Pod in Kubernetes?

A surprisingly simple task reveals a complicated workflow that touches several components in the cluster.

Let's start with the obvious: kubectl sends the YAML definition to the API server.

In this step, kubectl:

Discovers the API endpoints using OpenAPI (Swagger).
Negotiates the resource version.
Validates the YAML.
Issues the request.

When the request reaches the API, it goes through the following:

Authentication & authorization.
Admission controllers.

In the last step, it's finally stored in etcd.

After this, the pod is added to the scheduler queue.

The scheduler filters and scores the nodes to find the best one.

And it finally binds the pod to the node.

The binding is written in etcd.

At this point, the pod exists only in etcd as a record.

The infrastructure hasn't created any containers yet.

Here's where the kubelet takes over.

The kubelet pulls the Pod definition and proceeds to delegate:

Network creation to the CNI (e.g. Cilium).
Container creation to the CRI (e.g. containerd).
Storage creation to the CSI (e.g. OpenEBS).

Among other things, the Kubelet will execute the Pod's probes and, when the Pod is running, report its IP address to the control plane.

That IP and the containers' ports are stored as endpoints in etcd.

Wait… endpoint what?

In Kubernetes:

endpoint is a 10.0.0.2:3000 (IP:port) pair.
Endpoint is a collection of endpoints (a list of IP:port pairs).

For every Service in the cluster, Kubernetes creates an Endpoint object with endpoints.

Confusing, isn't it?

The endpoints (IP:port) are used by:

kube-proxy to set iptables rules.
CoreDNS to update the DNS entries.
Ingress controllers to set up downstreams.
Service meshes.
And more operators.

As soon as an endpoint is added, the components are notified.

When the endpoint (IP:port) is propagated, you can finally start using the Pod!

What happens when you delete a Pod?

The exact process but in reverse.

This is annoying because there are few opportunities for race conditions.

The correct sequence is:

App stops accepting connections.
Controllers (kube-proxy, ingress, etc.) to remove the endpoint.
App to drain existing connection.
App to shut down.

If you want to learn more about the graceful shutdown in Kubernetes, you can find my article here https://learnk8s.io/graceful-shutdown

And finally, if you've enjoyed this thread, you might also like the Kubernetes workshops that we run at Learnk8s https://learnk8s.io/training or this collection of past Twitter threads https://twitter.com/danielepolencic/status/1298543151901155330

Kubernetes scheduler deep dive

Daniele Polencic — Tue, 23 May 2023 12:44:34 +0000

A newer, expanded version of this article is available at learnkube.com/kubernetes-scheduler-explained

The scheduler is in charge of deciding where your pods are deployed in the cluster.

It might sound like an easy job, but it's rather complicated!

Let's start with the basic.

When you submit a deployment with kubectl, the API server receives the request, and the resource is stored in etcd.

Who creates the pods?

It's a common misconception that it's the scheduler's job to create the pods.

Instead, the controller manager creates them (and the associated ReplicaSet).

At this point, the pods are stored as "Pending" in the etcd and are not assigned to any node.

They are also added to the scheduler's queue, ready to be assigned.

The scheduler process Pods 1 by 1 through two phases:

Scheduling phase (what node should I choose?).
Binding phase (let's write to the database that this pod belongs to that node).

The Scheduler phase is divided into two parts. The Scheduler:

Filters relevant nodes (using a list of functions called predicates)
Ranks the remaining nodes (using a list of functions called priorities)

Let's have a look at an example.

Consider the following cluster with nodes with and without GPU.

Also, a few nodes are already running at total capacity.

You want to deploy a Pod that requires some GPU.

You submit the pod to the cluster, and it's added to the scheduler queue.

The scheduler discards all nodes that don't have GPU (filter phase).

Next, the scheduler scores the remaining nodes.

In this example, the fully utilized nodes are scored lower.

In the end, the empty node is selected.

What are some examples of filters?

NodeUnschedulable prevents pods from landing on nodes marked as unschedulable.
VolumeBinding checks if the node can bind the requested volume.

The default filtering phase has 13 predicates.

Here are some examples of scoring:

ImageLocality prefers nodes that already have the container image downloaded locally.
NodeResourcesBalancedAllocation prefers underutilized nodes.

There are 13 functions to decide how to score and rank nodes.

How can you influence the scheduler's decisions?

nodeSelector
Node affinity
Pod affinity/anti-affinity
Taints and tolerations
Topology constraints
Scheduler profiles

nodeSelector is the most straightforward mechanism.

You assign a label to a node and add that label to the pod.

The pod can only be deployed on nodes with that label.

Node affinity extends nodeSelector with a more flexible interface.

You can still tell the scheduler where the Pod should be deployed, but you can also have soft and hard constraints.

With Pod affinity/anti-affinity, you can ask the scheduler to place a pod next to a specific pod.

Or not.

For example, you could have a deployment with anti-affinity on itself to force spreading pods.

With taints and tolerations, pods are tainted, and nodes repel (or tolerate) pods.

This is similar to node affinity, but there's a notable difference: with Node affinity, Pods are attracted to nodes.

Taints are the opposite - they allow a node to repel pods.

Moreover, tolerations can repel pods with three effects: evict, "don't schedule", and "prefer don't schedule".

Personal note: this is one of the most difficult APIs I worked with.

I always (and consistently) get it wrong as it's hard (for me) to reason in double negatives.

You can use topology spread constraints to control how Pods are spread across your cluster.

This is convenient when you want to ensure that all pods aren't landing on the same node.

And finally, you can use Scheduler policies to customize how the scheduler uses filters and predicates to assign nodes to pods.

This relatively new feature (>1.25) allows you to turn off or add new logic to the scheduler.

You can learn more about the scheduler here:

Kubernetes scheduler https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/
Scheduling framework https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/
Scheduler policies https://kubernetes.io/docs/reference/scheduling/config/

And finally, if you've enjoyed this thread, you might also like:

The Kubernetes workshops that we run at Learnk8s https://learnkube.com/training
This collection of past threads https://twitter.com/danielepolencic/status/1298543151901155330
The Kubernetes newsletter I publish every week "Learn Kubernetes weekly" https://learnkube.com/learn-kubernetes-weekly

Traffic shaping with Istio and Kubernetes

Daniele Polencic — Mon, 15 May 2023 12:51:41 +0000

You can roll out an app only to a subset of your users in Kubernetes using canary releases with Istio, Kiali and the Gateway API.

Let's start by looking at an example.

The current cluster has three apps:

A backend that exposes an API at version v1.
Another app on version v2.
A frontend component that consumes the API.

Ideally, the frontend should consume 80% of requests from v1 and only 20% from v2.

But how?

You can use a Service Mesh for that.

As soon as you install the service mesh, each pod in the cluster gains an extra container.

The container proxies all the outgoing and incoming requests.

The proxy is automatically injected using a mutating webhook.

Before the pod is stored in etcd, the YAML definition is modified, and the proxy is injected.

A service mesh is helpful because you can:

Monitor metrics.
Trace dependencies between components.
Decide traffic splits.

I generated some traffic to test it and used Kiali to trace it.

It automatically mapped all the components and the direction of the traffic.

All without any hints from my side!

What about the canary release though?

You can use a service mesh to fine-tune how much traffic each app consumes.

To test it, I created an 80-20 split between the two backends.

In this example, I'm using an HTTPRoute:

HTTPRoute is an object part of the Gateway API that lets you gradually increase and decrease the traffic and which you can use to transition from an 80-20 split to 0-100.

Service meshes can also:

Help you roll out shadow releases.
Encrypt intrapod traffic.
Mirror traffic between cluster.
Inspect and rewrite traffic.
Enforce policies.
Inject faults to test the resilience.

And more!

Which one should you use?

At Learnk8s we've put together a spreadsheet to compare them.

And finally, if you've enjoyed this thread, you might also like:

Tracing pod to pod network traffic in Kubernetes

Daniele Polencic — Tue, 09 May 2023 06:42:25 +0000

How does Pod to Pod communication work in Kubernetes?

How does the traffic reach the pod?

In this article, you will dive into how low-level networking works in Kubernetes.

Let's start by focusing on the pod and node networking.

When you deploy a Pod, the following things happen:

The pod gets its own network namespace.
An IP address is assigned.
Any containers in the pod share the same networking namespace and can see each other on localhost.

A pod must first have access to the node's root namespace to reach other pods.

This is achieved using a virtual eth pair connecting the 2 namespaces: pod and root.

The bridge allows traffic to flow between virtual pairs and traverse through the common root namespace.

So what happens when Pod-A wants to send a message to Pod-B?

Since the destination isn't one of the containers in the namespace Pod-A sends out a packet to its default interface eth0.

This interface is tied to the veth pair and packets are forwarded to the root namespace.

The ethernet bridge, acting as a virtual switch, has to somehow resolve the destination pod IP (Pod-B) to its MAC address.

The ARP protocol comes to the rescue.

When the frame reaches the bridge, an ARP broadcast is sent to all connected devices.

The bridge shouts "Who has Pod-B IP address?"

A reply is received with the interface's MAC address that connects Pod-B, which is stored in the bridge ARP cache (lookup table).

Once the IP and MAC address mapping is stored, the bridge looks up in the table and forwards the packet to the correct endpoint.

The packet reaches Pod-B veth in the root namespace, and from there, it quickly reaches the eth0 interface inside the Pod-B namespace.

With this, the communication between Pod-A and Pod-B has been successful.

An additional hop is required for pods to communicate across different nodes, as the packets have to travel through the node network to reach their destination.

This is the "plain" networking version.

How does this change when you install a CNI plugin that uses an overlay network?

Let's take Flannel as an example.

Flannel installs a new interface between the node's eth0 and the container bridge cni0.

All traffic flowing through this interface is encapsulated (e.g. VXLAN, Wireguard, etc.).

The new packets don't have pods' IP addresses as source and destination, but nodes' IPs.

So the wrapper packet will exit from the node and travel to the destination node.

Once on the other side, the flannel.1 interface unwraps the packet and lets the original pod-to-pod packet reach its destination.

How does Flannel know where all the Pods are located and their IP addresses?

On each node, the Flannel daemon syncs the IP addresses allocations in a distributed database.

Other instances can query this database to decide where to send those packets.

Here are a few links if you want to learn more:

And finally, if you've enjoyed this thread, you might also like:

The Kubernetes workshops that we run at Learnk8s https://learnk8s.io/training
This collection of past threads https://twitter.com/danielepolencic/status/1298543151901155330
The Kubernetes newsletter I publish every week https://learnk8s.io/learn-kubernetes-weekly

How etcd works in Kubernetes

Daniele Polencic — Tue, 02 May 2023 13:37:21 +0000

A newer, expanded version of this article is available at learnkube.com/etcd-kubernetes

If you've ever interacted with a Kubernetes cluster in any way, chances are it was powered by etcd under the hood.

But even though etcd is at the heart of how Kubernetes works, it's rare to interact with it directly daily.

In this article, you will explore how it works!

Architecturally speaking, the Kubernetes API server is a CRUD application that stores manifests and serves data.

Hence, it needs a database to store its persisted data, which is where etcd fits into the picture.

According to its website, etcd is:

Strongly consistent.
Distributed.
Key-value store.

In addition, etcd has another feature that Kubernetes extensively uses: change notifications.

Etcd allows clients to subscribe to changes to a particular key or set of keys.

The Raft algorithm is the secret behind etcd's balance of strong consistency and high availability.

Raft solves a particular problem: how can multiple processes decide on a single value for something?

Raft works by electing a leader and forcing all write requests to go to it.

How does the Leader get elected, though?

First, all nodes start in the Follower state.

If followers don't hear from a leader, they can become candidates and request votes from other nodes.

Nodes reply with their vote.

The candidate with the majority of the votes becomes the Leader.

Changes are then replicated from the Leader to all other nodes; if the Leader ever goes offline, a new election is held, and a new leader is chosen.

What happens when you want to write a value in the database?

First, all write requests are redirected to the Leader.

The Leader makes a note of the requests but doesn't commit it to the log.

Instead, the Leader replicates the value to the rest of the (followers) nodes.

Finally, the Leader waits until a majority of nodes have written the entry and commits the value.

The state of the database contains the value.

Once the write succeeds, an acknowledgement is sent back to the client.

A new election is held if the cluster leader goes offline for any reason.

In practice, this means that etcd will remain available as long as most nodes are online.

How many nodes should an etcd cluster have to achieve "good enough" availability?

It depends.

To help you answer that question, let me ask another question!

Why stop at 3 etcds, why not having a cluster with 9 or 21 or more nodes?

Hint: check out the replication part.

The Leader has to wait for a quorum before the value is written to disk.

The more followers there are in the cluster, the longer it takes to reach a consensus.

In other words, you trade availability for speed.

If you enjoyed this thread but want to know more on:

Change notifications.
Creating etcd clusters.
Replacing etcd with SQL-like DBS with kine.

Check out this article.

And finally, if you've enjoyed this thread, you might also like:

The Kubernetes workshops that we run at Learnk8s https://learnkube.com/training
This collection of past threads https://twitter.com/danielepolencic/status/1298543151901155330
The Kubernetes newsletter I publish every week https://learnkube.com/learn-kubernetes-weekly

Labels and annotations in Kubernetes

Daniele Polencic — Mon, 24 Apr 2023 13:20:42 +0000

In Kubernetes, you can use labels to assign key-value pairs to any resources.

Labels are ubiquitous and necessary to everyday operations such as creating services.

However, how should you name and use those labels?

Any resource in Kubernetes can have labels.

Some labels are vital (e.g. service's selector, operators, etc.), and others are useful to tag resources (e.g. labelling a deployment).

Kubectl offers a --show-labels flag to help you list resources and their labels.

If you list pods, deployments and services in an empty cluster, you might notice that Kubernetes uses the component=<name> label to tag pods.

Kubernetes recommends six labels for your resources:

Name
Instance
Version
Component
Part of
Managed By

Let's look at an excellent example of using those labels: the Prometheus Helm chart.

The charts install five pods (i.e. server, alter manager, node exporter, push gateway and kube state metrics).

Notice how not all labels are applied to all pods.

Labelling resources properly helps you make sense of what's deployed.

For example, you can filter results with kubectl:

kubectl get pods -l "environment in (staging, dev)"

The command above only lists pod in staging and dev.

If those labels are not what you are after, you can always create your own.

A <prefix>/<name> key is recommended — e.g. company.com/database.

The following labels could be used in a multitenant cluster:

Business unit
Development team
Application
Client
Shared services
Environment
Compliance
Asset classification

Alongside labels, you have annotations.

Whereas labels are used to select resources, annotations decorate resources with metadata.

You cannot select resources with annotations.

Administrators can assign annotations to any workload.

However, more often, Kubernetes and operators decorate resources with extra annotations.

A good example is the annotation kubernetes.io/ingress-bandwidth to assign bandwidth to pods.

The official documentation has a list of well-known labels and annotations.

Here are some examples:

kubectl.kubernetesׄ.io/default-container
topology.kubernetes.io/region
node.kubernetes.io/instance-type
kubernetes.io/egress-bandwidth

Annotations are used extensively in operators.

Look at all the annotations you can use with the ingress-nginx controller.

Unfortunately, using operators/cloud providers/etc. annotations is not always a good idea if you wish to stay vendor-neutral.

However, sometimes it's also the only option (e.g. having an AWS ALB deployed in the correct subnet when using a service of type LoadBalancer).

Here are a few links if you want to learn more:

And finally, if you've enjoyed this thread, you might also like:

Autoscaling Ingress controllers in Kubernetes

Daniele Polencic — Mon, 17 Apr 2023 12:29:34 +0000

How do you deal with peaks of traffic in Kubernetes?

To autoscale the Ingress controller based on incoming requests, you need the following:

Metrics (e.g. the requests per second).
A metrics collector (to store the metrics).
An autoscaler (to act on the data).

Let's start with metrics.

The nginx-ingress can be configured to expose Prometheus metrics.

You can use nginx_connections_active to count the number of active requests.

Next, you need a way to scrape the metrics.

As you've already guessed, you can install Prometheus to do so.

Since Nginx-ingress uses annotations for Prometheus, I installed the server without the Kubernetes operator.

$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
"prometheus-community" has been added to your repositories
$ helm install prometheus prometheus-community/prometheus
NAME: prometheus
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None

I used Locust to generate some traffic to the Ingress to check that everything was running smoothly.

With the Prometheus dashboard open, I checked that the metrics increased as more traffic hit the controller.

The last piece of the puzzle was the autoscaler.

I decided to go with KEDA because:

It's an autoscaler with a metrics server (so I don't need to install 2 different tools).
It's easier to configure than the Prometheus adapter.
I can use the Horizontal Pod Autoscaler with PromQL.

Once I installed KEDA, I only had to create a ScaledObject, configure the source of the metrics (Prometheus), and scale the Pods (with a PromQL query).

KEDA automatically creates the HPA for me.

I repeated the tests with Locust and watched the replicas increase as more traffic hit the Nginx Ingress controller!

Can this pattern be extended to any other app?

Can you autoscale all microservices on the number of requests received?

Unless they expose the metrics, the answer is no.

However, there's a workaround.

KEDA ships with an HTTP add-on to enable HTTP scaling.

How does it work!?

KEDA injects a sidecar proxy in your pod so that all the HTTP traffic is routed first.

Then it measures the number of requests and exposes the metrics.

With that data at hand, you can trigger the autoscaler finally.

KEDA is not the only option, though.

You could install the Prometheus Adapter.

The metrics will flow from Nginx to Prometheus, and then the Adapter will make them available to Kubernetes.

From there, they are consumed by the Horizontal Pod Autoscaler.

Is this better than KEDA?

They are similar, as both have to query and buffer metrics from Prometheus.

However, KEDA is pluggable, and the Adapter works exclusively with Prometheus.

Is there a competitor to KEDA?

A promising project called the Custom Pod Autoscaler aims to make the pod autoscaler pluggable.

However, the project focuses more on how those pods should be scaled (i.e. algorithm) than the metrics collection.

During my research, I found these links helpful:

And finally, if you've enjoyed this thread, you might also like:

Multi-tenancy in Kubernetes

Daniele Polencic — Mon, 10 Apr 2023 12:28:23 +0000

Should you have more than one team using the same Kubernetes cluster?

Can you run untrusted workloads safely from untrusted users?

Does Kubernetes do multi-tenancy?

This article will explore the challenges of running a cluster with multiple tenants.

Multi-tenancy can be divided into:

Soft multi-tenancy for when you trust your tenants — like when you share a cluster with teams from the same company.
Hard multi-tenancy for when you don't trust tenants.

You can also have a mix!

The basic building block to share a cluster between tenants is the namespace.

Namespaces group resources logically — they don't offer any security mechanisms nor guarantee that all resources are deployed in the same node.

Pods in a namespace can still talk to all other pods in the cluster, make requests to the API, and use as many resources as they want.

Out of the box, any user can access any namespace.

How should you stop that?

With RBAC, you can limit what users and apps can do with and within a namespace.

A common operation is to grant permissions to limited users.

With Quotas and LimitRanges, you can limit the resources deployed in the namespace and the memory, CPU, etc., that can be utilized.

This is an excellent idea if you want to limit what a tenant can do with their namespace.

By default, all pods can talk to any pod in Kubernetes.

This is not great for multi-tenancy, but you can correct this with NetworkPolicies.

Network policies are similar to firewall rules that let you segregate outbound and inbound traffic.

Great, is the namespace secure now?

Not so fast.

While RBAC, NetworkPolicies, Quotas, etc., give you the basic building blocks for multi-tenancy is not enough.

Kubernetes has several shared components.

A good example is the Ingress controller, which is usually deployed once per cluster.

If you submit an Ingress manifest with the same path, the last overwrites the definition and only one works.

It's a better idea to deploy a controller per namespace.

Another interesting challenge is CoreDNS.

What if one of the tenants abuses the DNS?

The rest of the cluster will suffer too.

You could limit requests with an extra plugin https://github.com/coredns/policy.

The same challenge applies to the Kubernetes API server.

Kubernetes isn't aware of the tenant, and if the API receives too many requests, it will throttle them for everyone.

I don't know if there's a workaround for this!

Assuming you manage to sort out shared resources, there's also the challenge with the kubelet and workloads.

As Philippe Bogaerts explains in this article, a tenant could take over nodes in the cluster just (ab)using liveness probes.

The fix is not trivial.

You could have a linter as part of your CI/CD process or use admission controllers to verify that resources submitted to the cluster are safe.

Here is a library or rules for the Open Policy Agent.

You also have containers that offer a weaker isolation mechanism than virtual machines.

Lewis Denham-Parry shows how to escape from a container in this video.

How can you fix this?

You could use a container sandbox like gVisor, light virtual machines as containers (Kata containers, firecracker + containerd) or full virtual machines (virtlet as a CRI).

Hopefully, you've realized the complexity of the subject and how it's hard to provide rigid boundaries to separate networks, workloads, and controllers in Kubernetes.

That's why providing hard multi-tenancy in Kubernetes is not recommended.

If you need hard multi-tenancy, the advice is to use multiple clusters or a Cluster-as-a-Service tool instead.

If you can tolerate the weaker multi-tenancy model in exchange for simplicity and convenience, you can roll out your RBAC, Quotas, etc. rules.

But there are a few tools that abstract those problems from you:

And finally, if you've enjoyed this thread, you might also like:

The Kubernetes workshops that we run at Learnk8s https://learnk8s.io/training
This collection of past threads https://twitter.com/danielepolencic/status/1298543151901155330
The Kubernetes newsletter I publish every week https://learnk8s.io/learn-kubernetes-weekly

Pod rebalancing and allocations in Kubernetes

Daniele Polencic — Mon, 03 Apr 2023 12:43:38 +0000

Does Kubernetes rebalance your Pods?

If there's a node that has more space, does Kubernetes recompute and balance the workloads?

Let's have a look at an example.

You have a cluster with a single node that can host 2 Pods.

If the node crashes, you will experience downtime.

You could have a second node with one Pod each to prevent this.

You provision a second node.

What happens next?

Does Kubernetes notice that there's a space for your Pod?

Does it move the second Pod and rebalance the cluster?

Unfortunately, it does not.

But why?

When you define a Deployment, you specify:

The template for the Pod.
The number of copies (replicas).

But nowhere in that file you said you want one replica for each node!

The ReplicaSet counts 2 Pods, and that matches the desired state.

Kubernetes won't take any further action.

In other words, Kubernetes does not rebalance your pods automatically.

But you can fix this.

There are three popular options:

Pod (anti-)affinity.
Pod topology spread constraints.
The Descheduler.

The first option is to use pod anti-affinity.

With pod anti-affinity, your Pods repel other pods with the same label, forcing them to be on different nodes.

Notice how pod affinity is evaluated when the scheduler allocates the pods.

It is not applied retroactively, so you might need to delete a few pods to force the scheduler to recompute the allocations.

Alternatively, you can use topology spread constraints to control how Pods are spread across your cluster among failure domains such as regions, zones, nodes, etc.

This is similar to pod affinity but more powerful.

With topology spread constraints, you can pick the topology and choose the pod distribution (skew), what happens when the constraint is unfulfillable (schedule anyway vs don't) and the interaction with pod affinity and taints.

However, even in this case, the scheduler evaluates topology spread constraints when the pod is allocated.

It does not apply retroactively — you can still delete the pods and force the scheduler to reallocate them.

If you want to rebalance your pods dynamically (not just when the scheduler allocates them), you should check out the Descheduler.

The Descheduler scans your cluster at regular intervals, and if it finds a node that is more utilized than others, it deletes a pod in that node.

What happens when a Pod is deleted?

The ReplicaSet will create a new Pod, and the scheduler will likely place it in a less utilized node.

If your pod has topology spread constraints or pod affinity, it will be allocated accordingly.

The Descheduler can evict pods based on policies such as:

Node utilization.
Pod age.
Failed pods.
Duplicates.
Affinity or taints violations.

If your cluster has been running long, the resource utilization could be more balanced.

The following two strategies can be used to rebalance your cluster based on CPU, memory or number of pods.

Another practical policy is preventing developers and operators from treating pods like virtual machines.

You can use the descheduler to ensure pods only run for a fixed time (e.g. seven days).

Lastly, you can combine the Descheduler with Node Problem Detector and Cluster Autoscaler to automatically remove Nodes with problems.

The Descheduler can be used to descheduler workloads from those Nodes.

The Descheduler is an excellent choice to keep your cluster efficiency in check, but it isn't installed by default.

It can be deployed as a Job, CronJob or Deployment.

And finally, if you've enjoyed this thread, you might also like:

The Kubernetes workshops that we run at Learnk8s https://learnk8s.io/training
This collection of past threads https://twitter.com/danielepolencic/status/1298543151901155330
The Kubernetes newsletter I publish every week https://learnk8s.io/learn-kubernetes-weekly

Memory requests and limits in Kubernetes

Daniele Polencic — Mon, 27 Mar 2023 12:23:51 +0000

In Kubernetes, what should I use as memory requests and limits?

And what happens when you don't set them?

Let's dive into it.

In Kubernetes, you have two ways to specify how much memory a pod can use:

"Requests" are usually used to determine the average consumption.
"Limits" set the max number of resources allowed.

The Kubernetes scheduler uses requests to determine where the pod should be allocated in the cluster.

Since the scheduler doesn't know the consumption (the pod hasn't started yet), it needs a hint.

The kubelet uses limits to stop the process when it uses more memory than is allowed.

It's worth noting that the process could spike in memory usage before it's terminated.

The kubelet is also in charge of monitoring the total memory utilization of the node.

If memory is running low, the kubelet evicts low-priority pods.

But how does it decide what's low priority?

When Kubernetes creates a Pod, it assigns one of these QoS classes to the Pod:

Guaranteed
Burstable
BestEffort

Pods that are "Guaranteed" have CPU and memory requests and limits and are least likely to face eviction.

Also, memory request = memory limit AND CPU request = CPU limit.

This class is best suited for stateful applications like databases.

Pods with a "Burstable" class have memory and CPU requests but not limits.

This allows the Pods to flexibly increase their resources if available (but they could also use any amount of resources).

A Pod is "BestEffort" only if none of its containers has a memory or CPU limit or request.

Those Pods are the first to be evicted in the event of Node resource pressure.

Most of your pods are likely to be "Burstable" (i.e. requests, but fewer limits), and a very selected few should be "Guaranteed".

Burstable pods are good because they use resources dynamically and are cheaper.

With Guaranteed pods, you allocate all resources up to the limit upfront, which could result in more expensive (but safer) deployments.

BestEffort pods are generally something you should avoid.

The Kubernetes scheduler doesn't know how much memory or CPU the process needs, so it could end up scheduling an impractical amount of pods in the existing nodes.

But if you stick only to Burstable pods, how does the kubelet know which pod to evict first?

Pods can have PriorityClass that indicates the importance of a Pod relative to other Pods.

The scheduler also leverages the Pod PriorityClass to evict pods when the cluster is full.

For example, if you have low-priority batch jobs (e.g. reports), you could assign a low priority, and they will be evicted first.

How should you choose the memory and request of a pod?

A simple way is to calculate the smallest memory unit as:

REQ = NODE_MEM / MAX_PODS_PER_NODE

For a 4GB node and a limit of 10 Pods, that's a 400Mb request.

Assign the smallest unit or a multiplier to your containers.

A better approach is to monitor the app and derive the memory utilization.

You can do this with your existing monitoring infrastructure or use the Vertical Pod Autoscaler to monitor and report the average request value.

How should I set the limits?

Limits trigger eviction, so you should definitely set a value lower than the available memory.

Here's a handy calculator for that.

Also, if you want to dig in more a few relevant links:

And finally, if you've enjoyed this thread, you might also like:

The Kubernetes workshops that we run at Learnk8s https://learnk8s.io/training
This collection of past threads https://twitter.com/danielepolencic/status/1298543151901155330
The Kubernetes newsletter I publish every week https://learnk8s.io/learn-kubernetes-weekly

IP and pod allocations in EKS

Daniele Polencic — Tue, 21 Mar 2023 01:15:26 +0000

When running an EKS cluster, you might face two issues:

Running out of IP addresses assigned to pods.
Low pod count per node (due to ENI limits).

In this article, you will learn how to overcome those.

Before we start, here is some background on how intra-node networking works in Kubernetes.

When a node is created, the kubelet delegates:

Creating the container to the Container Runtime.
Attaching the container to the network to the CNI.
Mounting volumes to the CSI.

Let's focus on the CNI part.

Each pod has its own isolated Linux network namespace and is attached to a bridge.

The CNI is responsible for creating the bridge, assigning the IP and connecting veth0 to the cni0.

This usually happens, but different CNIs might use other means to connect the container to the network.

As an example, there might not be a cni0 bridge.

The AWS-CNI is an example of such a CNI.

In AWS, each EC2 instance can have multiple network interfaces (ENIs).

You can assign a limited number of IPs to each ENI.

For example, an m5.large can have up to 10 IPs for ENI.

Of those 10 IPs, you have to assign one to the network interface.

The rest you can give away.

Previously, you could use the extra IPs and assign them to Pods.

But there was a big limit: the number of IP addresses.

Let's have a look at an example.

With an m5.large, you have up to 3 ENIs with 10 IP private addresses each.

Since one IP is reserved, you're left with 9 per ENI (or 27 in total).

That means that your m5.large could run up to 27 Pods.

Not a lot.

But AWS released a change to EC2 that allows "prefixes" to be assigned to network interfaces.

Prefixes what?!

In simple words, ENIs now support a range instead of a single IP address.

If before you could have 10 private IP addresses, now you can have 10 slots of IP addresses.

And how big is the slot?

By default, 16 IP addresses.

With 10 slots, you could have up to 160 IP addresses.

That's a rather significant change!

Let's have a look at an example.

With an m5.large, you have 3 ENIs with 10 slots (or IPs) each.

Since one IP is reserved for the ENI, you're left with 9 slots.

Each slot is 16 IPs, so 9*16=144 IPs.

Since there are 3 ENIs, 144x3=432 IPs.

You can have up to 432 Pods now (vs 27 before).

The AWS-CNI support slots and caps the max number of Pods to 110 or 250, so you won't be able to run 432 Pods on an m5.large.

It's also worth pointing out that this is not enabled by default — not even in newer clusters.

Perhaps because only nitro instances support it.

Assigning slots it's great until you realize that the CNI gives 16 IP addresses at once instead of only 1, which has the following implications:

Quicker IP space exhaustion.
Fragmentation.

Let's review those.

A pod is scheduled to a node.

The AWS-CNI allocates 1 slot (16 IPs), and the pod uses one.

Now imagine having 5 nodes and a deployment with 5 replicas.

What happens?

The Kubernetes scheduler prefers to spread the pods across the cluster.

Likely, each node receives 1 pod, and the AWS-CNI allocates 1 slot (16 IPs).

You allocated 5*15=75 IPs from your network, but only 5 are used.

But there's more.

Slots allocate a contiguous block of IP addresses.

If a new IP is assigned (e.g. a node is created), you might have an issue with fragmentation.

How can you solve those?

Relevant links:

And finally, if you've enjoyed this thread, you might also like:

The Kubernetes workshops that we run at Learnk8s https://learnk8s.io/training
This collection of past threads https://twitter.com/danielepolencic/status/1298543151901155330
The Kubernetes newsletter I publish every week https://learnk8s.io/learn-kubernetes-weekly