<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Daniele Polencic</title>
    <description>The latest articles on DEV Community by Daniele Polencic (@danielepolencic).</description>
    <link>https://dev.to/danielepolencic</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F84473%2F1a8a7aa5-c250-402e-9b75-45c61bab002f.png</url>
      <title>DEV Community: Daniele Polencic</title>
      <link>https://dev.to/danielepolencic</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/danielepolencic"/>
    <language>en</language>
    <item>
      <title>Sticky sessions and canary releases in Kubernetes</title>
      <dc:creator>Daniele Polencic</dc:creator>
      <pubDate>Mon, 19 Jun 2023 12:17:23 +0000</pubDate>
      <link>https://dev.to/danielepolencic/sticky-sessions-and-canary-releases-in-kubernetes-5a92</link>
      <guid>https://dev.to/danielepolencic/sticky-sessions-and-canary-releases-in-kubernetes-5a92</guid>
      <description>&lt;p&gt;Sticky sessions or session affinity is a convenient strategy to keep subsequent requests always reaching the same pod.&lt;/p&gt;

&lt;p&gt;Let's look at how it works by deploying a sample application with three replicas and one service.&lt;/p&gt;

&lt;p&gt;In this scenario, &lt;strong&gt;requests directed to the service are load-balanced amongst the available replicas.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyfua7cbutm2ohh761b4h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyfua7cbutm2ohh761b4h.png" alt="A deployment and a service in Kubernetes" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's deploy the &lt;a href="https://github.com/kubernetes/ingress-nginx" rel="noopener noreferrer"&gt;ingress-nginx controller&lt;/a&gt; and create an Ingress manifest for the deployment.&lt;/p&gt;

&lt;p&gt;In this case, &lt;strong&gt;the ingress controller skips the services&lt;/strong&gt; and load balances the traffic directly to the pods.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffbt05k7sy0d62z82cyrt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffbt05k7sy0d62z82cyrt.png" alt="A deployment, a service and ingress controller in Kubernetes" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While the two scenarios end up with the same outcome (i.e. requests are distributed to all replicas), there's a subtle (but essential) distinction: &lt;strong&gt;the Service operates on L4 (TCP/UDP), whereas the Ingress is L7 (HTTP).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fierp2sq8px249to2g9wk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fierp2sq8px249to2g9wk.png" alt="Difference between Kubernetes service and ingress controller" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unlike the service, the Ingress controller can route traffic based on paths, headers, etc.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can also use it to define weights (e.g. 20-80 traffic split) or sticky sessions (all requests from the same origin always land on the same pod).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0rk64ou6k3z78rx0pbbw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0rk64ou6k3z78rx0pbbw.png" alt="Sticky sessions, weighted traffic and canary release in an Ingress controller" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The following Ingress implements &lt;strong&gt;sticky sessions for the nginx-ingress controller.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The ingress writes a cookie on your browser to keep track of what instance you visited.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdfxbohspc5ny18pfhch.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdfxbohspc5ny18pfhch.png" alt="Session affinity in ingress-nginx" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are two convenient settings for affinity:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;balanced&lt;/strong&gt; — requests are redistributed if the deployment scales up.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;persistent&lt;/strong&gt; — no matter what, the requests always stick to the same pod.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;nginx-ingress can also be used for canary releases.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you have two deployments and you wish to test a subset of the traffic for a newer version of that deployment, you can do so with a canary release (and impact a minimal amount of users).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgnzkr7ah9fol73d61k8n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgnzkr7ah9fol73d61k8n.png" alt="A canary release with ingress-nginx" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In a canary release, each deployment has its own Ingress manifest.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;However, one of those is labelled as a canary.&lt;/p&gt;

&lt;p&gt;You can decide how the traffic is forwarded: for example, you could inspect a header or cookie.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F224e08ozpaaygjx4ld47.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F224e08ozpaaygjx4ld47.png" alt="Canary release beased on a header value" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this example, all traffic labelled east-us is routed to the canary deployment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdpohg5iho8qzoo0w10ej.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdpohg5iho8qzoo0w10ej.png" alt="Example of routing traffic with a canary release" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can also decide which fraction of the total traffic is routed to the canary with weights.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgh691pgjtp55w3bmpghy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgh691pgjtp55w3bmpghy.png" alt="Setting weights to a canary release" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;But if the header is omitted in a subsequent request, the user will return to see the previous deployment.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;How can you fix that?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fviwq5lnfacrqw6bxxxt9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fviwq5lnfacrqw6bxxxt9.png" alt="Traffic in canary releases is not sticky" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;With sticky sessions!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You can combine canary releases and sticky sessions with ingress-nginx to progressively (and safely) roll out new deployments to your users.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyk26b5wn414oymtkmwmr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyk26b5wn414oymtkmwmr.png" alt="Combining canary releases and sticky sessions" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It's important to remember that those types of canary releases are only possible for front-facing apps.&lt;/p&gt;

&lt;p&gt;To roll out a canary release for internal microservices, you should look at alternatives (e.g. service mesh).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fppyeywyqv73gzwihvkao.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fppyeywyqv73gzwihvkao.png" alt="Traffic to internal services is not forwarded by the ingress controller" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Is nginx-ingress the only option for sticky sessions and canary releases?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Not really, but the annotations might be different to other ingress controllers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.google.com/spreadsheets/d/191WWNpjJ2za6-nbG4ZoUMXMpUK8KlCIosvQB0f-oq3k" rel="noopener noreferrer"&gt;At Learnk8s, we've put together a spreadsheet to compare them.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And finally, if you've enjoyed this thread, you might also like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://learnk8s.io/training" rel="noopener noreferrer"&gt;The Kubernetes workshops that we run at Learnk8s.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://twitter.com/danielepolencic/status/1298543151901155330" rel="noopener noreferrer"&gt;This collection of past threads.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://learnk8s.io/learn-kubernetes-weekly" rel="noopener noreferrer"&gt;The Kubernetes newsletter I publish every week.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While authoring this post, I also found the following resources valuable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.mirantis.com/mke/3.6/ops/deploy-apps-k8s/nginx-ingress/configure-canary-deployment.html" rel="noopener noreferrer"&gt;Configure a canary deployment&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://help.ovhcloud.com/csm/en-sg-public-cloud-kubernetes-sticky-session-nginx-ingress?id=kb_article_view&amp;amp;sysparm_article=KB0049968" rel="noopener noreferrer"&gt;Sticky sessions/Session Affinity based on Nginx Ingress&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pauldally.medium.com/session-affinity-and-kubernetes-proceed-with-caution-8e66fd5deb05" rel="noopener noreferrer"&gt;Session Affinity and Kubernetes— Proceed With Caution!&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/kubernetes/ingress-nginx/blob/main/docs/user-guide/nginx-configuration/annotations.md#session-affinity" rel="noopener noreferrer"&gt;Session Affinity&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
    </item>
    <item>
      <title>What happens when you create a Pod in Kubernetes</title>
      <dc:creator>Daniele Polencic</dc:creator>
      <pubDate>Tue, 30 May 2023 17:54:17 +0000</pubDate>
      <link>https://dev.to/danielepolencic/what-happens-when-you-create-a-pod-in-kubernetes-58io</link>
      <guid>https://dev.to/danielepolencic/what-happens-when-you-create-a-pod-in-kubernetes-58io</guid>
      <description>&lt;p&gt;What happens when you create a Pod in Kubernetes?&lt;/p&gt;

&lt;p&gt;A surprisingly simple task reveals a complicated workflow that touches several components in the cluster.&lt;/p&gt;

&lt;p&gt;Let's start with the obvious: &lt;strong&gt;kubectl sends the YAML definition to the API server.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In this step, kubectl:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Discovers the API endpoints using OpenAPI (Swagger).&lt;/li&gt;
&lt;li&gt;Negotiates the resource version.&lt;/li&gt;
&lt;li&gt;Validates the YAML.&lt;/li&gt;
&lt;li&gt;Issues the request.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgeina9t7otyis95rj12u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgeina9t7otyis95rj12u.png" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When the request reaches the API, it goes through the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Authentication &amp;amp; authorization.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Admission controllers.&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the last step, it's finally stored in etcd.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0vkpvl8hoafx9zokph1r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0vkpvl8hoafx9zokph1r.png" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After this, the pod is added to the scheduler queue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The scheduler filters and scores the nodes to find the best one.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;And it finally binds the pod to the node.&lt;/p&gt;

&lt;p&gt;The binding is written in etcd.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbdrzrp8on28glku8wymj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbdrzrp8on28glku8wymj.png" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At this point, the pod exists only in etcd as a record.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The infrastructure hasn't created any containers yet.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's where the kubelet takes over.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv77ly7mgmlqlz8aj8kcf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv77ly7mgmlqlz8aj8kcf.png" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The kubelet pulls the Pod definition and proceeds to delegate:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Network creation to the CNI (e.g. Cilium).&lt;/li&gt;
&lt;li&gt;Container creation to the CRI (e.g. containerd).&lt;/li&gt;
&lt;li&gt;Storage creation to the CSI (e.g. OpenEBS).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fljn6kiwk9m2zk9dxuysi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fljn6kiwk9m2zk9dxuysi.png" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Among other things, the Kubelet will execute the Pod's probes and, when the Pod is running, report its IP address to the control plane.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That IP and the containers' ports are stored as endpoints in etcd.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnpxk8vqp6obv50qrj7yy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnpxk8vqp6obv50qrj7yy.png" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Wait… endpoint what?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In Kubernetes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;endpoint is a 10.0.0.2:3000 (IP:port) pair.&lt;/li&gt;
&lt;li&gt;Endpoint is a collection of endpoints (a list of IP:port pairs).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For every Service in the cluster, &lt;strong&gt;Kubernetes creates an Endpoint object with endpoints.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Confusing, isn't it?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh7itlzs7p4161mf6q0j2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh7itlzs7p4161mf6q0j2.png" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The endpoints (IP:port) are used by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;kube-proxy to set iptables rules.&lt;/li&gt;
&lt;li&gt;CoreDNS to update the DNS entries.&lt;/li&gt;
&lt;li&gt;Ingress controllers to set up downstreams.&lt;/li&gt;
&lt;li&gt;Service meshes.&lt;/li&gt;
&lt;li&gt;And more operators.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As soon as an endpoint is added, the components are notified.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk294zd0hg8rq0hf653d0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk294zd0hg8rq0hf653d0.png" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When the endpoint (IP:port) is propagated, you can finally start using the Pod!&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What happens when you delete a Pod?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The exact process but in reverse.&lt;/p&gt;

&lt;p&gt;This is annoying because there are few opportunities for race conditions.&lt;/p&gt;

&lt;p&gt;The correct sequence is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;App stops accepting connections.&lt;/li&gt;
&lt;li&gt;Controllers (kube-proxy, ingress, etc.) to remove the endpoint.&lt;/li&gt;
&lt;li&gt;App to drain existing connection.&lt;/li&gt;
&lt;li&gt;App to shut down.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuouemh9mzvxfekbihtwq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuouemh9mzvxfekbihtwq.png" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you want to learn more about the graceful shutdown in Kubernetes, you can find my article here &lt;a href="https://learnk8s.io/graceful-shutdown" rel="noopener noreferrer"&gt;https://learnk8s.io/graceful-shutdown&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And finally, if you've enjoyed this thread, you might also like the Kubernetes workshops that we run at Learnk8s &lt;a href="https://learnk8s.io/training" rel="noopener noreferrer"&gt;https://learnk8s.io/training&lt;/a&gt; or this collection of past Twitter threads &lt;a href="https://twitter.com/danielepolencic/status/1298543151901155330" rel="noopener noreferrer"&gt;https://twitter.com/danielepolencic/status/1298543151901155330&lt;/a&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
    </item>
    <item>
      <title>Kubernetes scheduler deep dive</title>
      <dc:creator>Daniele Polencic</dc:creator>
      <pubDate>Tue, 23 May 2023 12:44:34 +0000</pubDate>
      <link>https://dev.to/danielepolencic/kubernetes-scheduler-deep-dive-3phj</link>
      <guid>https://dev.to/danielepolencic/kubernetes-scheduler-deep-dive-3phj</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A newer, expanded version of this article is available at &lt;a href="https://learnkube.com/kubernetes-scheduler-explained" rel="noopener noreferrer"&gt;learnkube.com/kubernetes-scheduler-explained&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The scheduler is in charge of deciding where your pods are deployed in the cluster.&lt;/p&gt;

&lt;p&gt;It might sound like an easy job, but it's rather complicated!&lt;/p&gt;

&lt;p&gt;Let's start with the basic.&lt;/p&gt;

&lt;p&gt;When you submit a deployment with kubectl, the API server receives the request, and the resource is stored in etcd.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Who creates the pods?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe0f611e8cbs4vtnuzaoi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe0f611e8cbs4vtnuzaoi.png" alt="A pod resource is stored in Etcd" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It's a common misconception that it's the scheduler's job to create the pods.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Instead, the controller manager creates them (and the associated ReplicaSet).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftessmrsxfalie5gc92nc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftessmrsxfalie5gc92nc.png" alt="The controller manager creates the pods" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At this point, the pods are stored as "Pending" in the etcd and are not assigned to any node.&lt;/p&gt;

&lt;p&gt;They are also added to the scheduler's queue, ready to be assigned.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ycupq8x3z66cdchn5i4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ycupq8x3z66cdchn5i4.png" alt="Pods are added to the scheduler queue." width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The scheduler process Pods 1 by 1 through two phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Scheduling phase (what node should I choose?).&lt;/li&gt;
&lt;li&gt;Binding phase (let's write to the database that this pod belongs to that node).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi1fuvn52rj03vhfbnygw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi1fuvn52rj03vhfbnygw.png" alt="Pods are allocated one at the time" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Scheduler phase is divided into two parts. The Scheduler:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Filters relevant nodes (using a list of functions called predicates)&lt;/li&gt;
&lt;li&gt;Ranks the remaining nodes (using a list of functions called priorities)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let's have a look at an example.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw2vnfepah7ke36y93dmp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw2vnfepah7ke36y93dmp.png" alt="The scheduler filters and scores nodes" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Consider the following cluster with nodes with and without GPU.&lt;/p&gt;

&lt;p&gt;Also, a few nodes are already running at total capacity.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmm5nf99i84mkimlctq2v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmm5nf99i84mkimlctq2v.png" alt="A collection of Kubernetes nodes" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You want to deploy a Pod that requires some GPU.&lt;/p&gt;

&lt;p&gt;You submit the pod to the cluster, and it's added to the scheduler queue.&lt;/p&gt;

&lt;p&gt;The scheduler discards all nodes that don't have GPU (filter phase).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqk3guozu8hp42y1k7y0e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqk3guozu8hp42y1k7y0e.png" alt="All non-GPU nodes are discarded" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next, the scheduler scores the remaining nodes.&lt;/p&gt;

&lt;p&gt;In this example, the fully utilized nodes are scored lower.&lt;/p&gt;

&lt;p&gt;In the end, the empty node is selected.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsibu12nobxigaysubftv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsibu12nobxigaysubftv.png" alt="The remaining nodes are scored" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What are some examples of filters?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;NodeUnschedulable&lt;/code&gt; prevents pods from landing on nodes marked as unschedulable.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;VolumeBinding&lt;/code&gt; checks if the node can bind the requested volume.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The default filtering phase has 13 predicates.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpbovjk4b5pnrr4ytqth5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpbovjk4b5pnrr4ytqth5.png" alt="Default predicates in the Kubernetes scheduler" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here are some examples of scoring:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;ImageLocality&lt;/code&gt; prefers nodes that already have the container image downloaded locally.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;NodeResourcesBalancedAllocation&lt;/code&gt; prefers underutilized nodes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are 13 functions to decide how to score and rank nodes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe9ltqojgnwihrinj90ow.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe9ltqojgnwihrinj90ow.png" alt="Default functions to score nodes in Kubernetes" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;How can you influence the scheduler's decisions?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;nodeSelector&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Node affinity&lt;/li&gt;
&lt;li&gt;Pod affinity/anti-affinity&lt;/li&gt;
&lt;li&gt;Taints and tolerations&lt;/li&gt;
&lt;li&gt;Topology constraints&lt;/li&gt;
&lt;li&gt;Scheduler profiles&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;nodeSelector&lt;/code&gt; is the most straightforward mechanism.&lt;/p&gt;

&lt;p&gt;You assign a label to a node and add that label to the pod.&lt;/p&gt;

&lt;p&gt;The pod can only be deployed on nodes with that label.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foc1cl8cr740ycryt2rpa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foc1cl8cr740ycryt2rpa.png" alt="Assigning pods to nodes with the nodeSelector" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Node affinity extends nodeSelector with a more flexible interface.&lt;/p&gt;

&lt;p&gt;You can still tell the scheduler where the Pod should be deployed, but you can also have soft and hard constraints.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fquerpdi9bclj960dq9qr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fquerpdi9bclj960dq9qr.png" alt="Assigning pods to nodes with node affinity" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With Pod affinity/anti-affinity, you can ask the scheduler to place a pod next to a specific pod.&lt;/p&gt;

&lt;p&gt;Or not.&lt;/p&gt;

&lt;p&gt;For example, you could have a deployment with anti-affinity on itself to force spreading pods.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmat547ed1drxons9v40z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmat547ed1drxons9v40z.png" alt="Scheduling pods with pod affinity and anti-affinity" width="800" height="300"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With taints and tolerations, pods are tainted, and nodes repel (or tolerate) pods.&lt;/p&gt;

&lt;p&gt;This is similar to node affinity, but there's a notable difference: with Node affinity, Pods are attracted to nodes.&lt;/p&gt;

&lt;p&gt;Taints are the opposite - they allow a node to repel pods.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmuw0msjhdmj9fxomj0c5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmuw0msjhdmj9fxomj0c5.png" alt="Scheduling pods with taints and tolerations" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Moreover, tolerations can repel pods with three effects: evict, "don't schedule", and "prefer don't schedule".&lt;/p&gt;

&lt;p&gt;Personal note: this is one of the most difficult APIs I worked with.&lt;/p&gt;

&lt;p&gt;I always (and consistently) get it wrong as it's hard (for me) to reason in double negatives.&lt;/p&gt;

&lt;p&gt;You can use topology spread constraints to control how Pods are spread across your cluster.&lt;/p&gt;

&lt;p&gt;This is convenient when you want to ensure that all pods aren't landing on the same node.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9dhvq28uygtkajy2pskc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9dhvq28uygtkajy2pskc.png" alt="Pod topology constraints" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And finally, you can use Scheduler policies to customize how the scheduler uses filters and predicates to assign nodes to pods.&lt;/p&gt;

&lt;p&gt;This relatively new feature (&amp;gt;1.25) allows you to turn off or add new logic to the scheduler.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu4k0yt5iatymigsk1r9w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu4k0yt5iatymigsk1r9w.png" alt="Scheduler policies in Kubernetes" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can learn more about the scheduler here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kubernetes scheduler &lt;a href="https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Scheduling framework &lt;a href="https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Scheduler policies &lt;a href="https://kubernetes.io/docs/reference/scheduling/config/" rel="noopener noreferrer"&gt;https://kubernetes.io/docs/reference/scheduling/config/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And finally, if you've enjoyed this thread, you might also like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Kubernetes workshops that we run at Learnk8s &lt;a href="https://learnkube.com/training" rel="noopener noreferrer"&gt;https://learnkube.com/training&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;This collection of past threads &lt;a href="https://twitter.com/danielepolencic/status/1298543151901155330" rel="noopener noreferrer"&gt;https://twitter.com/danielepolencic/status/1298543151901155330&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;The Kubernetes newsletter I publish every week "Learn Kubernetes weekly" &lt;a href="https://learnkube.com/learn-kubernetes-weekly" rel="noopener noreferrer"&gt;https://learnkube.com/learn-kubernetes-weekly&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
    </item>
    <item>
      <title>Traffic shaping with Istio and Kubernetes</title>
      <dc:creator>Daniele Polencic</dc:creator>
      <pubDate>Mon, 15 May 2023 12:51:41 +0000</pubDate>
      <link>https://dev.to/danielepolencic/traffic-shaping-with-istio-and-kubernetes-4pcf</link>
      <guid>https://dev.to/danielepolencic/traffic-shaping-with-istio-and-kubernetes-4pcf</guid>
      <description>&lt;p&gt;You can roll out an app only to a subset of your users in Kubernetes using canary releases with Istio, Kiali and the Gateway API.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Let's start by looking at an example.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The current cluster has three apps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A backend that exposes an API at version v1.&lt;/li&gt;
&lt;li&gt;Another app on version v2.&lt;/li&gt;
&lt;li&gt;A frontend component that consumes the API.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ugybbov6jbaf8cgckeq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ugybbov6jbaf8cgckeq.png" alt="Two apps deployed in a Kubernetes cluster" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ideally, the frontend should consume 80% of requests from v1 and only 20% from v2.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;But how?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You can use a Service Mesh for that.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F73rgqdl550i8xr28qbea.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F73rgqdl550i8xr28qbea.png" alt="Splitting traffic with a Service mesh" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As soon as you install the service mesh, &lt;strong&gt;each pod in the cluster gains an extra container.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The container proxies all the outgoing and incoming requests.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6zvrgethsvogwmn9xa7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6zvrgethsvogwmn9xa7.png" alt="An Envoy proxies intercepting all outgoing and incoming requests" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The proxy is automatically injected using a mutating webhook.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before the pod is stored in etcd, the YAML definition is modified, and the proxy is injected.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flhoia43um4ylvreaed0b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flhoia43um4ylvreaed0b.png" alt="A mutating webhook automatically injects the proxy in the pod" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A service mesh is helpful because you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monitor metrics.&lt;/li&gt;
&lt;li&gt;Trace dependencies between components.&lt;/li&gt;
&lt;li&gt;Decide traffic splits.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhed3bc19rz0lmld5955o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhed3bc19rz0lmld5955o.png" alt="Why you might want to use use a service mesh" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I generated some traffic to test it and used Kiali to trace it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It automatically mapped all the components and the direction of the traffic.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All without any hints from my side!&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What about the canary release though?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdnjssxfq3vl6kuc4hqlt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdnjssxfq3vl6kuc4hqlt.png" alt="Mapping microservices with Kiali" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can use a service mesh to fine-tune how much traffic each app consumes.&lt;/p&gt;

&lt;p&gt;To test it, I created an 80-20 split between the two backends.&lt;/p&gt;

&lt;p&gt;In this example, I'm using an HTTPRoute:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp4wpu8shggjzxtxxkp2v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp4wpu8shggjzxtxxkp2v.png" alt="HTTPRoute Custom Resource Definition" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://gateway-api.sigs.k8s.io/api-types/httproute/" rel="noopener noreferrer"&gt;HTTPRoute is an object part of the Gateway API&lt;/a&gt; that lets you gradually increase and decrease the traffic and which you can use to transition from an 80-20 split to 0-100.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72bnhxs50nd8w5onlqam.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F72bnhxs50nd8w5onlqam.gif" alt="Demo of splitting traffic in Kubernetes" width="720" height="444"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Service meshes can also:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Help you roll out shadow releases.&lt;/li&gt;
&lt;li&gt;Encrypt intrapod traffic.&lt;/li&gt;
&lt;li&gt;Mirror traffic between cluster.&lt;/li&gt;
&lt;li&gt;Inspect and rewrite traffic.&lt;/li&gt;
&lt;li&gt;Enforce policies.&lt;/li&gt;
&lt;li&gt;Inject faults to test the resilience.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And more!&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Which one should you use?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.google.com/spreadsheets/d/1Bxf8VW9n-YyHeBiKdXt6zytOgw2cQlsDnK1gLUvsZ4A/edit#gid=907731238" rel="noopener noreferrer"&gt;At Learnk8s we've put together a spreadsheet to compare them.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And finally, if you've enjoyed this thread, you might also like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://learnk8s.io/training" rel="noopener noreferrer"&gt;The Kubernetes workshops that we run at Learnk8s.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://twitter.com/danielepolencic/status/1298543151901155330" rel="noopener noreferrer"&gt;This collection of past threads.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://learnk8s.io/learn-kubernetes-weekly" rel="noopener noreferrer"&gt;The Kubernetes newsletter I publish every week.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
    </item>
    <item>
      <title>Tracing pod to pod network traffic in Kubernetes</title>
      <dc:creator>Daniele Polencic</dc:creator>
      <pubDate>Tue, 09 May 2023 06:42:25 +0000</pubDate>
      <link>https://dev.to/danielepolencic/tracing-pod-to-pod-network-traffic-in-kubernetes-434k</link>
      <guid>https://dev.to/danielepolencic/tracing-pod-to-pod-network-traffic-in-kubernetes-434k</guid>
      <description>&lt;p&gt;&lt;em&gt;How does Pod to Pod communication work in Kubernetes?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;How does the traffic reach the pod?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In this article, you will dive into how low-level networking works in Kubernetes.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let's start by focusing on the pod and node networking.&lt;/p&gt;

&lt;p&gt;When you deploy a Pod, the following things happen:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The pod gets its own network namespace.&lt;/li&gt;
&lt;li&gt;An IP address is assigned.&lt;/li&gt;
&lt;li&gt;Any containers in the pod share the same networking namespace and can see each other on localhost.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6hgc1bms7j4x6vwivd5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6hgc1bms7j4x6vwivd5.png" alt="Network namespaces in a Kubernetes node" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A pod must first have access to the node's root namespace to reach other pods.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is achieved using a virtual eth pair connecting the 2 namespaces: pod and root.&lt;/p&gt;

&lt;p&gt;The bridge allows traffic to flow between virtual pairs and traverse through the common root namespace.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fju9foh1z6vmmv73v2oqy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fju9foh1z6vmmv73v2oqy.png" alt="Bridge that connects all containers in the node" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;So what happens when Pod-A wants to send a message to Pod-B?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Since the destination isn't one of the containers in the namespace Pod-A sends out a packet to its default interface &lt;code&gt;eth0&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This interface is tied to the veth pair and packets are forwarded to the root namespace.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjavz4evd7mp2c6tscqsh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjavz4evd7mp2c6tscqsh.png" alt="Tracing the flow: starting from the container" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The ethernet bridge, acting as a virtual switch, has to somehow resolve the destination pod IP (Pod-B) to its MAC address.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvyzcpr400hc122gki6cr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvyzcpr400hc122gki6cr.png" alt="The packet reaches the cni0 bridge" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The ARP protocol comes to the rescue.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When the frame reaches the bridge, an ARP broadcast is sent to all connected devices.&lt;/p&gt;

&lt;p&gt;The bridge shouts "Who has Pod-B IP address?"&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frn25e9pnzu3m3f52mijw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frn25e9pnzu3m3f52mijw.png" alt="ARP queries" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A reply is received with the interface's MAC address that connects Pod-B, which is stored in the bridge ARP cache (lookup table).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9dbh742symtzi2j4xwdy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9dbh742symtzi2j4xwdy.png" alt="ARP reply" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once the IP and MAC address mapping is stored, the bridge looks up in the table and forwards the packet to the correct endpoint.&lt;/p&gt;

&lt;p&gt;The packet reaches Pod-B veth in the root namespace, and from there, it quickly reaches the eth0 interface inside the Pod-B namespace.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmrklvvqjsgo6fj2ypdt1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmrklvvqjsgo6fj2ypdt1.png" alt="The packet reaches the other pod" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With this, the communication between Pod-A and Pod-B has been successful.&lt;/p&gt;

&lt;p&gt;An additional hop is required for pods to communicate across different nodes, as the packets have to travel through the node network to reach their destination.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbhgnhbo0uzv3d9a3xgiq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbhgnhbo0uzv3d9a3xgiq.png" alt="Tracing pod traffic across nodes" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the "plain" networking version.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;How does this change when you install a CNI plugin that uses an overlay network?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Let's take Flannel as an example.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Flannel installs a new interface between the node's &lt;code&gt;eth0&lt;/code&gt; and the container bridge &lt;code&gt;cni0&lt;/code&gt;.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All traffic flowing through this interface is encapsulated (e.g. VXLAN, Wireguard, etc.).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7n4nf6g20k3ub1dt8c9e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7n4nf6g20k3ub1dt8c9e.png" alt="The Flannel interface encapsulates the traffic" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The new packets don't have pods' IP addresses as source and destination, but nodes' IPs.&lt;/p&gt;

&lt;p&gt;So the wrapper packet will exit from the node and travel to the destination node.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl1mw55m264vy7qv6e4w6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl1mw55m264vy7qv6e4w6.png" alt="Packets are encpsulated using different backends" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once on the other side, the &lt;code&gt;flannel.1&lt;/code&gt; interface unwraps the packet and lets the original pod-to-pod packet reach its destination.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv8sfrldlje5ru2xfbdt7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv8sfrldlje5ru2xfbdt7.png" alt="The packet is unwrapped and forwarded to the interface" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;How does Flannel know where all the Pods are located and their IP addresses?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;On each node, the Flannel daemon syncs the IP addresses allocations in a distributed database.&lt;/p&gt;

&lt;p&gt;Other instances can query this database to decide where to send those packets.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe3bettw4c3jw4qvk7czt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe3bettw4c3jw4qvk7czt.png" alt="The flannel deaemons sync IP addresses to a (distributed) database" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here are a few links if you want to learn more:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://learnk8s.io/kubernetes-network-packets" rel="noopener noreferrer"&gt;https://learnk8s.io/kubernetes-network-packets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://zhuanlan.zhihu.com/p/340747753" rel="noopener noreferrer"&gt;https://zhuanlan.zhihu.com/p/340747753&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.laputa.io/kubernetes-flannel-networking-6a1cb1f8ec7c" rel="noopener noreferrer"&gt;https://blog.laputa.io/kubernetes-flannel-networking-6a1cb1f8ec7c&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.sobyte.net/post/2022-07/k8s-flannel/" rel="noopener noreferrer"&gt;https://www.sobyte.net/post/2022-07/k8s-flannel/&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And finally, if you've enjoyed this thread, you might also like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Kubernetes workshops that we run at Learnk8s &lt;a href="https://learnk8s.io/training" rel="noopener noreferrer"&gt;https://learnk8s.io/training&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;This collection of past threads &lt;a href="https://twitter.com/danielepolencic/status/1298543151901155330" rel="noopener noreferrer"&gt;https://twitter.com/danielepolencic/status/1298543151901155330&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;The Kubernetes newsletter I publish every week &lt;a href="https://learnk8s.io/learn-kubernetes-weekly" rel="noopener noreferrer"&gt;https://learnk8s.io/learn-kubernetes-weekly&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
    </item>
    <item>
      <title>How etcd works in Kubernetes</title>
      <dc:creator>Daniele Polencic</dc:creator>
      <pubDate>Tue, 02 May 2023 13:37:21 +0000</pubDate>
      <link>https://dev.to/danielepolencic/how-etcd-works-in-kubernetes-373l</link>
      <guid>https://dev.to/danielepolencic/how-etcd-works-in-kubernetes-373l</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A newer, expanded version of this article is available at &lt;a href="https://learnkube.com/etcd-kubernetes" rel="noopener noreferrer"&gt;learnkube.com/etcd-kubernetes&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If you've ever interacted with a Kubernetes cluster in any way, chances are it was powered by etcd under the hood.&lt;/p&gt;

&lt;p&gt;But even though etcd is at the heart of how Kubernetes works, it's rare to interact with it directly daily.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;In this article, you will explore how it works!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architecturally speaking, the Kubernetes API server is a CRUD application that stores manifests and serves data.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Hence, it needs a database to store its persisted data, which is where etcd fits into the picture.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmyir5f1nvvh5kwbqbq10.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmyir5f1nvvh5kwbqbq10.png" alt="Kubernetes control plane" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;According to its website, etcd is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Strongly consistent.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Distributed.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Key-value store.&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In addition, etcd has another feature that Kubernetes extensively uses: &lt;strong&gt;change notifications.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Etcd allows clients to subscribe to changes to a particular key or set of keys.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg29zg5m6fvjou6kepe26.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg29zg5m6fvjou6kepe26.png" alt="Key features of etcd" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Raft algorithm is the secret behind etcd's balance of strong consistency and high availability.&lt;/p&gt;

&lt;p&gt;Raft solves a particular problem: how can multiple processes decide on a single value for something?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Raft works by electing a leader and forcing all write requests to go to it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Forudux4tkltp4j15dnpj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Forudux4tkltp4j15dnpj.png" alt="The Raft algorithm" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;How does the Leader get elected, though?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First, all nodes start in the Follower state.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fik1givwrql32em40zst4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fik1givwrql32em40zst4.png" alt="All nodes start in the follower state" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If followers don't hear from a leader, they can become candidates and request votes from other nodes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqfz3zfh9777375et2y6d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqfz3zfh9777375et2y6d.png" alt="Followers can be become candidate" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Nodes reply with their vote.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The candidate with the majority of the votes becomes the Leader.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Changes are then replicated from the Leader to all other nodes; if the Leader ever goes offline, a new election is held, and a new leader is chosen.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flxcg6u4sl0lla4gwphvs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flxcg6u4sl0lla4gwphvs.png" alt="A candidate becomes a leader" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What happens when you want to write a value in the database?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First, all write requests are redirected to the Leader.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Leader makes a note of the requests but doesn't commit it to the log.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiwueebp4r4pq724mdccm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiwueebp4r4pq724mdccm.png" alt="All requests are forwarded to the Leader" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Instead, the Leader replicates the value to the rest of the (followers) nodes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj9fmgl8vcdkdr0cm7udn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj9fmgl8vcdkdr0cm7udn.png" alt="The leader replicates the value to the followers" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Finally, the Leader waits until a majority of nodes have written the entry and commits the value.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The state of the database contains the value.&lt;/p&gt;

&lt;p&gt;Once the write succeeds, an acknowledgement is sent back to the client.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fko3bf9vsw3equk4zitmn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fko3bf9vsw3equk4zitmn.png" alt="The value is written to disk" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A new election is held if the cluster leader goes offline for any reason.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In practice, this means that etcd will remain available as long as most nodes are online.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;How many nodes should an etcd cluster have to achieve "good enough" availability?&lt;/p&gt;

&lt;p&gt;&lt;em&gt;It depends.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdfsvy0yzd7mkzlidi2rr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdfsvy0yzd7mkzlidi2rr.png" alt="RAFT HA table" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To help you answer that question, let me ask another question!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why stop at 3 etcds, why not having a cluster with 9 or 21 or more nodes?&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Hint: check out the replication part.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvnn23c54jlqjhfb3vk83.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvnn23c54jlqjhfb3vk83.png" alt="A cluster with 9 nodes" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Leader has to wait for a quorum before the value is written to disk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The more followers there are in the cluster, the longer it takes to reach a consensus.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In other words, you trade availability for speed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fliaekb0cbeabbaa5qwjn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fliaekb0cbeabbaa5qwjn.png" alt="A cluster with 9 nodes takes more time to write values to disk" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you enjoyed this thread but want to know more on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Change notifications.&lt;/li&gt;
&lt;li&gt;Creating etcd clusters.&lt;/li&gt;
&lt;li&gt;Replacing etcd with SQL-like DBS with kine.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://learnkube.com/etcd-kubernetes" rel="noopener noreferrer"&gt;Check out this article.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And finally, if you've enjoyed this thread, you might also like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Kubernetes workshops that we run at Learnk8s &lt;a href="https://learnkube.com/training" rel="noopener noreferrer"&gt;https://learnkube.com/training&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;This collection of past threads &lt;a href="https://twitter.com/danielepolencic/status/1298543151901155330" rel="noopener noreferrer"&gt;https://twitter.com/danielepolencic/status/1298543151901155330&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;The Kubernetes newsletter I publish every week &lt;a href="https://learnkube.com/learn-kubernetes-weekly" rel="noopener noreferrer"&gt;https://learnkube.com/learn-kubernetes-weekly&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
    </item>
    <item>
      <title>Labels and annotations in Kubernetes</title>
      <dc:creator>Daniele Polencic</dc:creator>
      <pubDate>Mon, 24 Apr 2023 13:20:42 +0000</pubDate>
      <link>https://dev.to/danielepolencic/labels-and-annotations-in-kubernetes-j4i</link>
      <guid>https://dev.to/danielepolencic/labels-and-annotations-in-kubernetes-j4i</guid>
      <description>&lt;p&gt;In Kubernetes, you can use labels to assign key-value pairs to any resources.&lt;/p&gt;

&lt;p&gt;Labels are ubiquitous and necessary to everyday operations such as creating services.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;However, how should you name and use those labels?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Any resource in Kubernetes can have labels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Some labels are vital (e.g. service's selector, operators, etc.), and others are useful to tag resources (e.g. labelling a deployment).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbl6hiq0n3j6maqgbkhmj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbl6hiq0n3j6maqgbkhmj.png" alt="Examples of useful and useless labels in Kubernetes" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Kubectl offers a &lt;code&gt;--show-labels&lt;/code&gt; flag to help you list resources and their labels.&lt;/p&gt;

&lt;p&gt;If you list pods, deployments and services in an empty cluster, you might notice that Kubernetes uses the &lt;code&gt;component=&amp;lt;name&amp;gt;&lt;/code&gt; label to tag pods.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdr89pndt0c94ixg8ov9l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdr89pndt0c94ixg8ov9l.png" alt="Retrieving labels from resources" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kubernetes recommends six labels for your resources:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Name&lt;/li&gt;
&lt;li&gt;Instance&lt;/li&gt;
&lt;li&gt;Version&lt;/li&gt;
&lt;li&gt;Component&lt;/li&gt;
&lt;li&gt;Part of&lt;/li&gt;
&lt;li&gt;Managed By&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F36u9kmkxynv4yo7gt6cs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F36u9kmkxynv4yo7gt6cs.png" alt="Recommended labels for resources" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's look at an excellent example of using those labels: &lt;a href="https://github.com/prometheus-community/helm-charts" rel="noopener noreferrer"&gt;the Prometheus Helm chart.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The charts install five pods (i.e. server, alter manager, node exporter, push gateway and kube state metrics).&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Notice how not all labels are applied to all pods.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm9cull8tfcfeiv2cil7y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm9cull8tfcfeiv2cil7y.png" alt="Prometheus Helm chart labels" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Labelling resources properly helps you make sense of what's deployed.&lt;/p&gt;

&lt;p&gt;For example, you can filter results with kubectl:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;-l&lt;/span&gt; &lt;span class="s2"&gt;"environment in (staging, dev)"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The command above only lists pod in staging and dev.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fas12xt72dt1ku1dc58co.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fas12xt72dt1ku1dc58co.png" alt="Selecting resources that are tagged with labels" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If those labels are not what you are after, you can always create your own.&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;&amp;lt;prefix&amp;gt;/&amp;lt;name&amp;gt;&lt;/code&gt; key is recommended — e.g. &lt;code&gt;company.com/database&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F44vlwpel2nuvh1w2khu0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F44vlwpel2nuvh1w2khu0.png" alt="Custom label with prefixes" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The following labels could be used in a multitenant cluster:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Business unit&lt;/li&gt;
&lt;li&gt;Development team&lt;/li&gt;
&lt;li&gt;Application&lt;/li&gt;
&lt;li&gt;Client&lt;/li&gt;
&lt;li&gt;Shared services&lt;/li&gt;
&lt;li&gt;Environment&lt;/li&gt;
&lt;li&gt;Compliance&lt;/li&gt;
&lt;li&gt;Asset classification&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpkkqufnq77vwhq7muc1l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpkkqufnq77vwhq7muc1l.png" alt="Labels for multitenant cluster" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Alongside labels, you have annotations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Whereas labels are used to select resources, annotations decorate resources with metadata.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You cannot select resources with annotations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwbq1gxh2lnxdt219afbo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwbq1gxh2lnxdt219afbo.png" alt="Annotations in Kubernetes" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Administrators can assign annotations to any workload.&lt;/p&gt;

&lt;p&gt;However, more often, &lt;strong&gt;Kubernetes and operators decorate resources with extra annotations.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A good example is the annotation &lt;code&gt;kubernetes.io/ingress-bandwidth&lt;/code&gt; to assign bandwidth to pods.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq9ie2xd3rsnsl02qqult.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq9ie2xd3rsnsl02qqult.png" alt="Annotating resources with ingress and egress bandwidth constraints" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kubernetes.io/docs/reference/labels-annotations-taints/" rel="noopener noreferrer"&gt;The official documentation has a list of well-known labels and annotations.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here are some examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;kubectl.kubernetesׄ.io/default-container&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;topology.kubernetes.io/region&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;node.kubernetes.io/instance-type&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;kubernetes.io/egress-bandwidth&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Annotations are used extensively in operators.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Look at &lt;a href="https://github.com/kubernetes/ingress-nginx/blob/main/docs/user-guide/nginx-configuration/annotations.md" rel="noopener noreferrer"&gt;all the annotations you can use with the ingress-nginx controller.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F212zvzep6tvnvcj1q3bx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F212zvzep6tvnvcj1q3bx.png" alt="There are several annotations available for the ingress-nginx controller" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Unfortunately, using operators/cloud providers/etc. annotations is not always a good idea if you wish to stay vendor-neutral.&lt;/p&gt;

&lt;p&gt;However, sometimes it's also the only option &lt;a href="https://docs.aws.amazon.com/eks/latest/userguide/network-load-balancing.html" rel="noopener noreferrer"&gt;(e.g. having an AWS ALB deployed in the correct subnet when using a service of type LoadBalancer).&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx23nvae36nn225y6g8wp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx23nvae36nn225y6g8wp.png" alt="Annotations available on AWS and EKS" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here are a few links if you want to learn more:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://blog.kubecost.com/blog/kubernetes-labels/" rel="noopener noreferrer"&gt;The Guide to Kubernetes Labels&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/docs/reference/labels-annotations-taints/" rel="noopener noreferrer"&gt;Well-Known Labels, Annotations and Taints&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/" rel="noopener noreferrer"&gt;Recommended Labels&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.tigera.io/blog/label-standard-and-best-practices-for-kubernetes-security/" rel="noopener noreferrer"&gt;Label standard and best practices for Kubernetes security&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And finally, if you've enjoyed this thread, you might also like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://learnk8s.io/training" rel="noopener noreferrer"&gt;The Kubernetes workshops that we run at Learnk8s.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://twitter.com/danielepolencic/status/1298543151901155330" rel="noopener noreferrer"&gt;This collection of past threads.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://learnk8s.io/learn-kubernetes-weekly" rel="noopener noreferrer"&gt;The Kubernetes newsletter I publish every week.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
    </item>
    <item>
      <title>Autoscaling Ingress controllers in Kubernetes</title>
      <dc:creator>Daniele Polencic</dc:creator>
      <pubDate>Mon, 17 Apr 2023 12:29:34 +0000</pubDate>
      <link>https://dev.to/danielepolencic/autoscaling-ingress-controllers-in-kubernetes-1kgn</link>
      <guid>https://dev.to/danielepolencic/autoscaling-ingress-controllers-in-kubernetes-1kgn</guid>
      <description>&lt;p&gt;How do you deal with peaks of traffic in Kubernetes?&lt;/p&gt;

&lt;p&gt;To autoscale the Ingress controller based on incoming requests, you need the following:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Metrics&lt;/strong&gt; (e.g. the requests per second).&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;metrics collector&lt;/strong&gt; (to store the metrics).&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;autoscaler&lt;/strong&gt; (to act on the data).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs39xaiwzmh5ve5ax848c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs39xaiwzmh5ve5ax848c.png" alt="What you need to scale the ingress controller" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Let's start with metrics.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kubernetes.github.io/ingress-nginx/user-guide/monitoring/" rel="noopener noreferrer"&gt;The nginx-ingress can be configured to expose Prometheus metrics.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can use &lt;code&gt;nginx_connections_active&lt;/code&gt; to count the number of active requests.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fui5x00aluqy7vr8z2zek.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fui5x00aluqy7vr8z2zek.png" alt="nginx_connections_active to count the number of active requests" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next, you need a way to scrape the metrics.&lt;/p&gt;

&lt;p&gt;As you've already guessed, you can install Prometheus to do so.&lt;/p&gt;

&lt;p&gt;Since Nginx-ingress uses annotations for Prometheus, I &lt;a href="https://github.com/prometheus-community/helm-charts" rel="noopener noreferrer"&gt;installed the server without the Kubernetes operator.&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
&lt;span class="s2"&gt;"prometheus-community"&lt;/span&gt; has been added to your repositories
&lt;span class="nv"&gt;$ &lt;/span&gt;helm &lt;span class="nb"&gt;install &lt;/span&gt;prometheus prometheus-community/prometheus
NAME: prometheus
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I used Locust to generate some traffic to the Ingress to check that everything was running smoothly.&lt;/p&gt;

&lt;p&gt;With the Prometheus dashboard open, I checked that the metrics increased as more traffic hit the controller.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6my1dslikvwzslbfri29.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6my1dslikvwzslbfri29.gif" alt="Testing active connections" width="800" height="332"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The last piece of the puzzle was the autoscaler.&lt;/p&gt;

&lt;p&gt;I decided to go with &lt;a href="https://keda.sh" rel="noopener noreferrer"&gt;KEDA&lt;/a&gt; because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;It's an autoscaler with a &lt;a href="https://github.com/kubernetes-sigs/metrics-server" rel="noopener noreferrer"&gt;metrics server&lt;/a&gt; (so I don't need to install 2 different tools).&lt;/li&gt;
&lt;li&gt;It's easier to configure than the Prometheus adapter.&lt;/li&gt;
&lt;li&gt;I can use the &lt;a href="https://keda.sh/docs/2.10/scalers/prometheus/" rel="noopener noreferrer"&gt;Horizontal Pod Autoscaler with PromQL.&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgwva0bpqx0g8ygoc4qh6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgwva0bpqx0g8ygoc4qh6.png" alt="How KEDA works" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once I installed KEDA, I only had to &lt;a href="https://keda.sh/docs/2.10/concepts/scaling-deployments/#scaledobject-spec" rel="noopener noreferrer"&gt;create a ScaledObject&lt;/a&gt;, configure the source of the metrics (Prometheus), and scale the Pods (with a PromQL query).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffw1ryofbk64613mcgv6h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffw1ryofbk64613mcgv6h.png" alt="Example of a ScaledObject for Prometheus scaler in KEDA" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;KEDA automatically creates the HPA for me.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I repeated the tests with Locust and watched the replicas increase as more traffic hit the Nginx Ingress controller!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fohe03mnyuq4xybyegr6w.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fohe03mnyuq4xybyegr6w.gif" alt="Scaling the Ingress Controllers with KEDA" width="720" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Can this pattern be extended to any other app?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Can you autoscale all microservices on the number of requests received?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Unless they expose the metrics, the answer is no.&lt;/p&gt;

&lt;p&gt;However, there's a workaround.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/kedacore/http-add-on" rel="noopener noreferrer"&gt;KEDA ships with an HTTP add-on to enable HTTP scaling.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;How does it work!?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;KEDA injects a sidecar proxy in your pod so that all the HTTP traffic is routed first.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Then it measures the number of requests and exposes the metrics.&lt;/p&gt;

&lt;p&gt;With that data at hand, you can trigger the autoscaler finally.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0b4uunk9as95dc5wwgcb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0b4uunk9as95dc5wwgcb.png" alt="Keda HTTP add-on architecture" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;KEDA is not the only option, though.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You could install the &lt;a href="https://github.com/kubernetes-sigs/prometheus-adapter" rel="noopener noreferrer"&gt;Prometheus Adapter.&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The metrics will flow from Nginx to Prometheus, and then the Adapter will make them available to Kubernetes.&lt;/p&gt;

&lt;p&gt;From there, they are consumed by the Horizontal Pod Autoscaler.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flibi8p5k3o255wind6vp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flibi8p5k3o255wind6vp.png" alt="Prometheus adapter architecture" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Is this better than KEDA?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;They are similar, as both have to query and buffer metrics from Prometheus.&lt;/p&gt;

&lt;p&gt;However, KEDA is pluggable, and the Adapter works exclusively with Prometheus.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff7jpqldim11durl9t7dh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff7jpqldim11durl9t7dh.png" alt="Similarity between KEDA &amp;amp; the Prometheus Adapter" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Is there a competitor to KEDA?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A promising project called the &lt;a href="https://custom-pod-autoscaler.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;Custom Pod Autoscaler aims to make the pod autoscaler pluggable.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, the project focuses more on how those pods should be scaled (i.e. algorithm) than the metrics collection.&lt;/p&gt;

&lt;p&gt;During my research, I found these links helpful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://keda.sh/docs/2.10/scalers/prometheus/" rel="noopener noreferrer"&gt;https://keda.sh/docs/2.10/scalers/prometheus/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sysdig.com/blog/kubernetes-hpa-prometheus/" rel="noopener noreferrer"&gt;https://sysdig.com/blog/kubernetes-hpa-prometheus/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/nginxinc/nginx-prometheus-exporter#exported-metrics" rel="noopener noreferrer"&gt;https://github.com/nginxinc/nginx-prometheus-exporter#exported-metrics&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://learnk8s.io/scaling-celery-rabbitmq-kubernetes" rel="noopener noreferrer"&gt;https://learnk8s.io/scaling-celery-rabbitmq-kubernetes&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And finally, if you've enjoyed this thread, you might also like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://learnk8s.io/training" rel="noopener noreferrer"&gt;The Kubernetes workshops that we run at Learnk8s.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://twitter.com/danielepolencic/status/1298543151901155330" rel="noopener noreferrer"&gt;This collection of past threads.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://learnk8s.io/learn-kubernetes-weekly" rel="noopener noreferrer"&gt;The Kubernetes newsletter I publish every week.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
    </item>
    <item>
      <title>Multi-tenancy in Kubernetes</title>
      <dc:creator>Daniele Polencic</dc:creator>
      <pubDate>Mon, 10 Apr 2023 12:28:23 +0000</pubDate>
      <link>https://dev.to/danielepolencic/multi-tenancy-in-kubernetes-49d1</link>
      <guid>https://dev.to/danielepolencic/multi-tenancy-in-kubernetes-49d1</guid>
      <description>&lt;p&gt;&lt;em&gt;Should you have more than one team using the same Kubernetes cluster?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Can you run untrusted workloads safely from untrusted users?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Does Kubernetes do multi-tenancy?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This article will explore the challenges of running a cluster with multiple tenants.&lt;/p&gt;

&lt;p&gt;Multi-tenancy can be divided into:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Soft multi-tenancy&lt;/strong&gt; for when you trust your tenants — like when you share a cluster with teams from the same company.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hard multi-tenancy&lt;/strong&gt; for when you don't trust tenants.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You can also have a mix!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8fooaqaghmumw1t3dume.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8fooaqaghmumw1t3dume.png" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The basic building block to share a cluster between tenants is the namespace.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Namespaces group resources logically — they don't offer any security mechanisms nor guarantee that all resources are deployed in the same node.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn7rqt1q0kkmx7y6hmih5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn7rqt1q0kkmx7y6hmih5.png" alt="Namespaces are used to logically group resources" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pods in a namespace can still talk to all other pods in the cluster, make requests to the API, and use as many resources as they want.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Out of the box, any user can access any namespace.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;How should you stop that?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9dbez80io54m1vca5mnf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9dbez80io54m1vca5mnf.png" alt="Namespaces don't offer any mechanism for isolation" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://learnk8s.io/rbac-kubernetes" rel="noopener noreferrer"&gt;With RBAC, you can limit what users and apps can do with and within a namespace.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A common operation is to grant permissions to limited users.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4rd77a1kqkrmnmkhmlqj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4rd77a1kqkrmnmkhmlqj.png" alt="Limiting what users can do in a namespace" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kubernetes.io/docs/concepts/policy/resource-quotas/" rel="noopener noreferrer"&gt;With Quotas&lt;/a&gt; and &lt;a href="https://kubernetes.io/docs/concepts/policy/limit-range/" rel="noopener noreferrer"&gt;LimitRanges&lt;/a&gt;, you can limit the resources deployed in the namespace and the memory, CPU, etc., that can be utilized.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is an excellent idea if you want to limit what a tenant can do with their namespace.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyei4in9tfyfv3hj39cu2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyei4in9tfyfv3hj39cu2.png" alt="LimitRange and ResourceQuota" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;By default, all pods can talk to any pod in Kubernetes.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is not great for multi-tenancy, but you can correct this with &lt;a href="https://github.com/ahmetb/kubernetes-network-policy-recipes" rel="noopener noreferrer"&gt;NetworkPolicies.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Network policies are similar to firewall rules that let you segregate outbound and inbound traffic.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg2iy8cdu3b9t5y6q5kvn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg2iy8cdu3b9t5y6q5kvn.png" alt="Network policies in Kubernetes" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Great, is the namespace secure now?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Not so fast.&lt;/p&gt;

&lt;p&gt;While RBAC, NetworkPolicies, Quotas, etc., give you the basic building blocks for multi-tenancy is not enough.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kubernetes has several shared components.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A good example is the Ingress controller, which is usually deployed once per cluster.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you submit an Ingress manifest with the same path, the last overwrites the definition and only one works.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwngp105jbkvboy5x4flk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwngp105jbkvboy5x4flk.png" alt="A single Ingress controller shared across namespaces" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It's a better idea to deploy a controller per namespace.&lt;/p&gt;

&lt;p&gt;Another interesting challenge is CoreDNS.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What if one of the tenants abuses the DNS?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The rest of the cluster will suffer too.&lt;/p&gt;

&lt;p&gt;You could limit requests with an extra plugin &lt;a href="https://github.com/coredns/policy" rel="noopener noreferrer"&gt;https://github.com/coredns/policy&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F92o8xy1ag4jwbg1rz779.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F92o8xy1ag4jwbg1rz779.png" alt="Abusing of CoreDNS" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The same challenge applies to the Kubernetes API server.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Kubernetes isn't aware of the tenant, and if the API receives too many requests, it will throttle them for everyone.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I don't know if there's a workaround for this!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs3pmznmm8ccqc3206a3l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs3pmznmm8ccqc3206a3l.png" alt="Abusing of the Kubernetes API" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Assuming you manage to sort out shared resources, there's also the challenge with the kubelet and workloads.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://xxradar.medium.com/exploiting-applications-using-livenessprobes-in-kubernetes-cdff6329d320" rel="noopener noreferrer"&gt;As Philippe Bogaerts explains in this article,&lt;/a&gt; a tenant could take over nodes in the cluster just (ab)using liveness probes.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The fix is not trivial.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhpct405xfa1tdwrlswbe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhpct405xfa1tdwrlswbe.png" alt="Exploiting Kubernetes with liveness probes" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You could have a linter as part of your CI/CD process or use admission controllers to verify that resources submitted to the cluster are safe.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/open-policy-agent/gatekeeper-library" rel="noopener noreferrer"&gt;Here is a library or rules for the Open Policy Agent.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmu4k2ari8g4n1wq30s6d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmu4k2ari8g4n1wq30s6d.png" alt="Using admission controllers to lint and validate resources" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You also have containers that offer a weaker isolation mechanism than virtual machines.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.youtube.com/watch?v=JaMJJTb_bEE" rel="noopener noreferrer"&gt;Lewis Denham-Parry shows how to escape from a container in this video.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;How can you fix this?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You could use a container sandbox like &lt;a href="https://gvisor.dev/" rel="noopener noreferrer"&gt;gVisor&lt;/a&gt;, light virtual machines as containers (&lt;a href="https://katacontainers.io/" rel="noopener noreferrer"&gt;Kata containers&lt;/a&gt;, &lt;a href="https://github.com/firecracker-microvm/firecracker-containerd" rel="noopener noreferrer"&gt;firecracker + containerd&lt;/a&gt;) or full virtual machines (&lt;a href="https://github.com/Mirantis/virtlet" rel="noopener noreferrer"&gt;virtlet as a CRI&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fggp48mngfp9l6fj3t711.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fggp48mngfp9l6fj3t711.png" alt="gVisor, Kata containers, Firecracker + containerd, virtlet" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Hopefully, you've realized the complexity of the subject and how it's hard to provide rigid boundaries to separate networks, workloads, and controllers in Kubernetes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://blog.jessfraz.com/post/hard-multi-tenancy-in-kubernetes/" rel="noopener noreferrer"&gt;That's why providing hard multi-tenancy in Kubernetes is not recommended.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you need hard multi-tenancy, the advice is to use multiple clusters or a Cluster-as-a-Service tool instead.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/kubernetes-sigs/cluster-api" rel="noopener noreferrer"&gt;Cluster API&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/openshift/hypershift" rel="noopener noreferrer"&gt;HyperShift&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/clastix/kamaji" rel="noopener noreferrer"&gt;Kamaji&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gardener.cloud/" rel="noopener noreferrer"&gt;Gardener&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you can tolerate the weaker multi-tenancy model in exchange for simplicity and convenience, you can roll out your RBAC, Quotas, etc. rules.&lt;/p&gt;

&lt;p&gt;But there are a few tools that abstract those problems from you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/kubernetes-sigs/cluster-api-provider-nested/tree/main/virtualcluster" rel="noopener noreferrer"&gt;Virtual Cluster (wg-multitenancy)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.vcluster.com/" rel="noopener noreferrer"&gt;Vcluster&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/clastix/capsule" rel="noopener noreferrer"&gt;Capsule&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/loft-sh/kiosk" rel="noopener noreferrer"&gt;Kiosk&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And finally, if you've enjoyed this thread, you might also like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Kubernetes workshops that we run at Learnk8s &lt;a href="https://learnk8s.io/training" rel="noopener noreferrer"&gt;https://learnk8s.io/training&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;This collection of past threads &lt;a href="https://twitter.com/danielepolencic/status/1298543151901155330" rel="noopener noreferrer"&gt;https://twitter.com/danielepolencic/status/1298543151901155330&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;The Kubernetes newsletter I publish every week &lt;a href="https://learnk8s.io/learn-kubernetes-weekly" rel="noopener noreferrer"&gt;https://learnk8s.io/learn-kubernetes-weekly&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
    </item>
    <item>
      <title>Pod rebalancing and allocations in Kubernetes</title>
      <dc:creator>Daniele Polencic</dc:creator>
      <pubDate>Mon, 03 Apr 2023 12:43:38 +0000</pubDate>
      <link>https://dev.to/danielepolencic/pod-rebalancing-and-allocations-in-kubernetes-4b9n</link>
      <guid>https://dev.to/danielepolencic/pod-rebalancing-and-allocations-in-kubernetes-4b9n</guid>
      <description>&lt;p&gt;&lt;strong&gt;Does Kubernetes rebalance your Pods?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If there's a node that has more space, does Kubernetes recompute and balance the workloads?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Let's have a look at an example.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;You have a cluster with a single node that can host 2 Pods.&lt;/p&gt;

&lt;p&gt;If the node crashes, you will experience downtime.&lt;/p&gt;

&lt;p&gt;You could have a second node with one Pod each to prevent this.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsyickd24r3k1o3f3vraz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsyickd24r3k1o3f3vraz.png" alt="A Kubernetes cluster with a single node" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You provision a second node.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What happens next?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Does Kubernetes notice that there's a space for your Pod?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Does it move the second Pod and rebalance the cluster?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy5enh062bcnox8y2ca49.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy5enh062bcnox8y2ca49.png" alt="Does Kubernetes move the pods to the lower utilized node?" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Unfortunately, it does not.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;But why?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When you define a Deployment, you specify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The template for the Pod.&lt;/li&gt;
&lt;li&gt;The number of copies (replicas).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvwkxn4k20a20ppwnci3e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvwkxn4k20a20ppwnci3e.png" alt="A Kubernetes deployment" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But nowhere in that file you said you want one replica for each node!&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The ReplicaSet counts 2 Pods, and that matches the desired state.&lt;/p&gt;

&lt;p&gt;Kubernetes won't take any further action.&lt;/p&gt;

&lt;p&gt;In other words, Kubernetes &lt;strong&gt;does not&lt;/strong&gt; rebalance your pods automatically.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;But you can fix this.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;There are three popular options:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pod (anti-)affinity.&lt;/li&gt;
&lt;li&gt;Pod topology spread constraints.&lt;/li&gt;
&lt;li&gt;The Descheduler.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The first option is to &lt;a href="https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/" rel="noopener noreferrer"&gt;use pod anti-affinity.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;With pod anti-affinity, your Pods repel other pods with the same label, forcing them to be on different nodes.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/" rel="noopener noreferrer"&gt;You can read more about pod anti-affinity here.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe7p9l3w0lj6408h85frv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe7p9l3w0lj6408h85frv.png" alt="Example of a pod anti-affinity" width="800" height="300"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Notice how pod affinity is evaluated when the scheduler allocates the pods.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It is not applied retroactively, so you might need to delete a few pods to force the scheduler to recompute the allocations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm0ulg6fdwlkqpg437r9f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm0ulg6fdwlkqpg437r9f.png" alt="Kubernetes does not rebalance two pods that have pod anti-affinity and are already alloacated to the same node" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alternatively, you can use &lt;a href="https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/" rel="noopener noreferrer"&gt;topology spread constraints&lt;/a&gt; to control how Pods are spread across your cluster among failure domains such as regions, zones, nodes, etc.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is similar to pod affinity but more powerful.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ewebdgqalgikgreyg2g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ewebdgqalgikgreyg2g.png" alt="Spreading posts across failure domains" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With topology spread constraints, you can pick the topology and choose the pod distribution (skew), what happens when the constraint is unfulfillable (schedule anyway vs don't) and the interaction with pod affinity and taints.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzpakievs3priy1nn2o1c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzpakievs3priy1nn2o1c.png" alt="Example of pod topology spread constraints" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;However, even in this case, the scheduler evaluates topology spread constraints when the pod is allocated.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It does not apply retroactively — you can still delete the pods and force the scheduler to reallocate them.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs4aa6za60dufb82grw8g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs4aa6za60dufb82grw8g.png" alt="Kubernetes does not rebalance two pods that have pod topology spread constraints and are already alloacated to the same node" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you want to rebalance your pods dynamically (not just when the scheduler allocates them), you should check out &lt;a href="https://github.com/kubernetes-sigs/descheduler" rel="noopener noreferrer"&gt;the Descheduler.&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Descheduler scans your cluster at regular intervals, and if it finds a node that is more utilized than others, it deletes a pod in that node.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo09pllo1j63lxue7nnxh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo09pllo1j63lxue7nnxh.png" alt="The Descheduler deletes pods" width="800" height="300"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What happens when a Pod is deleted?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The ReplicaSet will create a new Pod, and the scheduler will likely place it in a less utilized node.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If your pod has topology spread constraints or pod affinity, it will be allocated accordingly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdlag22ylfnxgh8bihoh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwdlag22ylfnxgh8bihoh.png" alt="The Kubernetes scheduler will allocate pods efficiently" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Descheduler can evict pods based on policies such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Node utilization.&lt;/li&gt;
&lt;li&gt;Pod age.&lt;/li&gt;
&lt;li&gt;Failed pods.&lt;/li&gt;
&lt;li&gt;Duplicates.&lt;/li&gt;
&lt;li&gt;Affinity or taints violations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl3k49a894pjy5am0c7jq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl3k49a894pjy5am0c7jq.png" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If your cluster has been running long, the resource utilization could be more balanced.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The following two strategies can be used to rebalance your cluster based on CPU, memory or number of pods.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Forifo4kbu8d7jhc3d0l9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Forifo4kbu8d7jhc3d0l9.png" alt="Descheduler configuration to rebalance under and overtilized nodes" width="800" height="300"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Another practical policy is preventing developers and operators from treating pods like virtual machines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You can use the descheduler to ensure pods only run for a fixed time (e.g. seven days).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fni6dsp1xaqzax739jzad.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fni6dsp1xaqzax739jzad.png" alt="Deleting pods after 7 days of utilization" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lastly, you can combine the Descheduler with Node Problem Detector and Cluster Autoscaler to automatically remove Nodes with problems.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/kubernetes-sigs/descheduler/blob/master/docs/user-guide.md#autoheal-node-problems" rel="noopener noreferrer"&gt;The Descheduler can be used to descheduler workloads from those Nodes.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Descheduler is an excellent choice to keep your cluster efficiency in check, but it isn't installed by default.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/kubernetes-sigs/descheduler" rel="noopener noreferrer"&gt;It can be deployed as a Job, CronJob or Deployment.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And finally, if you've enjoyed this thread, you might also like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Kubernetes workshops that we run at Learnk8s &lt;a href="https://learnk8s.io/training" rel="noopener noreferrer"&gt;https://learnk8s.io/training&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;This collection of past threads &lt;a href="https://twitter.com/danielepolencic/status/1298543151901155330" rel="noopener noreferrer"&gt;https://twitter.com/danielepolencic/status/1298543151901155330&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;The Kubernetes newsletter I publish every week &lt;a href="https://learnk8s.io/learn-kubernetes-weekly" rel="noopener noreferrer"&gt;https://learnk8s.io/learn-kubernetes-weekly&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
    </item>
    <item>
      <title>Memory requests and limits in Kubernetes</title>
      <dc:creator>Daniele Polencic</dc:creator>
      <pubDate>Mon, 27 Mar 2023 12:23:51 +0000</pubDate>
      <link>https://dev.to/danielepolencic/memory-requests-and-limits-in-kubernetes-40ep</link>
      <guid>https://dev.to/danielepolencic/memory-requests-and-limits-in-kubernetes-40ep</guid>
      <description>&lt;p&gt;In Kubernetes, what should I use as memory requests and limits?&lt;/p&gt;

&lt;p&gt;And what happens when you don't set them?&lt;/p&gt;

&lt;p&gt;Let's dive into it.&lt;/p&gt;

&lt;p&gt;In Kubernetes, you have two ways to specify how much memory a pod can use:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;"Requests" are usually used to determine the average consumption.&lt;/li&gt;
&lt;li&gt;"Limits" set the max number of resources allowed.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The Kubernetes scheduler uses requests to determine where the pod should be allocated in the cluster.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Since the scheduler doesn't know the consumption (the pod hasn't started yet), it needs a hint.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzls66ti1zi1wr0y3bls3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzls66ti1zi1wr0y3bls3.png" alt="The Kubernetes scheduler works best with requests" width="800" height="300"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The kubelet uses limits to stop the process when it uses more memory than is allowed.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It's worth noting that the process could spike in memory usage before it's terminated.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhmm8nnzk2lh73fob9j5r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhmm8nnzk2lh73fob9j5r.png" alt="The kubelet terminates the container when it goes over the memory limit" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The kubelet is also in charge of monitoring the total memory utilization of the node.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If memory is running low, the kubelet evicts low-priority pods.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;But how does it decide what's low priority?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8gq8i10qtcctjgiv5hd0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8gq8i10qtcctjgiv5hd0.png" alt="The kubelet evict pods if the node is running low on resources" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When Kubernetes creates a Pod, it assigns one of these QoS classes to the Pod:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Guaranteed&lt;/li&gt;
&lt;li&gt;Burstable&lt;/li&gt;
&lt;li&gt;BestEffort&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Pods that are "Guaranteed" have CPU and memory requests and limits and are least likely to face eviction.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Also, memory request = memory limit AND CPU request = CPU limit.&lt;/p&gt;

&lt;p&gt;This class is best suited for stateful applications like databases.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6xtyh1poz299hfvp682n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6xtyh1poz299hfvp682n.png" alt="Guaranteed Quality of Service for Pods" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pods with a "Burstable" class have memory and CPU requests but not limits.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This allows the Pods to flexibly increase their resources if available (but they could also use any amount of resources).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fknu85ds6wetr0ekv679x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fknu85ds6wetr0ekv679x.png" alt="burstable Quality of Service for Pods" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A Pod is "BestEffort" only if none of its containers has a memory or CPU limit or request.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Those Pods are the first to be evicted in the event of Node resource pressure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc6jbn14rp7nk46avqpke.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc6jbn14rp7nk46avqpke.png" alt="Burstable Quality of Service for Pods" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Most of your pods are likely to be "Burstable" (i.e. requests, but fewer limits), and a very selected few should be "Guaranteed".&lt;/p&gt;

&lt;p&gt;Burstable pods are good because they use resources dynamically and are cheaper.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu2utuxz2ttz1hcupd3vo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu2utuxz2ttz1hcupd3vo.png" alt="With Burstable pods you can dymically allocate resources as the container needs them" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With Guaranteed pods, you allocate all resources up to the limit upfront, which could result in more expensive (but safer) deployments.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm8m2s7ks50t56f7yg16z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm8m2s7ks50t56f7yg16z.png" alt="With Guaranteed pods, resources are allocated upfront and can't be freed even if the process isn't using them" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;BestEffort pods are generally something you should avoid.&lt;/p&gt;

&lt;p&gt;The Kubernetes scheduler doesn't know how much memory or CPU the process needs, so it could end up scheduling an impractical amount of pods in the existing nodes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo2vgkdcp1uuiaelcq0wo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo2vgkdcp1uuiaelcq0wo.png" alt="You can fit as many BestEffort pods in a node as you wish" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;But if you stick only to Burstable pods, how does the kubelet know which pod to evict first?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pods can have PriorityClass that indicates the importance of a Pod relative to other Pods.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fic7bydsdvpj9or3smn19.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fic7bydsdvpj9or3smn19.png" alt="Pod PriorityClass" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The scheduler also leverages the Pod PriorityClass to evict pods when the cluster is full.&lt;/p&gt;

&lt;p&gt;For example, if you have low-priority batch jobs (e.g. reports), you could assign a low priority, and they will be evicted first.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmdvr8o1i3kj4jpiz1tqn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmdvr8o1i3kj4jpiz1tqn.png" alt="Pods with lower priority are evicted to make space for higher priority pods" width="800" height="300"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;How should you choose the memory and request of a pod?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A simple way is to calculate the smallest memory unit as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;REQ = NODE_MEM / MAX_PODS_PER_NODE
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a 4GB node and a limit of 10 Pods, that's a 400Mb request.&lt;/p&gt;

&lt;p&gt;Assign the smallest unit or a multiplier to your containers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frpofkcll17k3bd0ko9fm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frpofkcll17k3bd0ko9fm.png" alt="Assigning requests for your pods" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A better approach is to monitor the app and derive the memory utilization.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can do this with your existing monitoring infrastructure or use the Vertical Pod Autoscaler to monitor and report the average request value.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsqastyo6hdxpzlstxgzn.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsqastyo6hdxpzlstxgzn.gif" alt="Measuring memory consumption with the Vertical Pod Autoscaler" width="760" height="465"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;How should I set the limits?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Limits trigger eviction, so you should definitely set a value lower than the available memory.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://learnk8s.io/kubernetes-instance-calculator" rel="noopener noreferrer"&gt;Here's a handy calculator for that.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Also, if you want to dig in more a few relevant links:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/" rel="noopener noreferrer"&gt;Pod priorities&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/" rel="noopener noreferrer"&gt;Node pressure&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod" rel="noopener noreferrer"&gt;Pod Quality of Service&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And finally, if you've enjoyed this thread, you might also like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Kubernetes workshops that we run at Learnk8s &lt;a href="https://learnk8s.io/training" rel="noopener noreferrer"&gt;https://learnk8s.io/training&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;This collection of past threads &lt;a href="https://twitter.com/danielepolencic/status/1298543151901155330" rel="noopener noreferrer"&gt;https://twitter.com/danielepolencic/status/1298543151901155330&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;The Kubernetes newsletter I publish every week &lt;a href="https://learnk8s.io/learn-kubernetes-weekly" rel="noopener noreferrer"&gt;https://learnk8s.io/learn-kubernetes-weekly&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
    </item>
    <item>
      <title>IP and pod allocations in EKS</title>
      <dc:creator>Daniele Polencic</dc:creator>
      <pubDate>Tue, 21 Mar 2023 01:15:26 +0000</pubDate>
      <link>https://dev.to/danielepolencic/ip-and-pod-allocations-in-eks-5me</link>
      <guid>https://dev.to/danielepolencic/ip-and-pod-allocations-in-eks-5me</guid>
      <description>&lt;p&gt;When running an EKS cluster, you might face two issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Running out of IP addresses assigned to pods.&lt;/li&gt;
&lt;li&gt;Low pod count per node (due to ENI limits).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this article, you will learn how to overcome those.&lt;/p&gt;

&lt;p&gt;Before we start, here is some background on how intra-node networking works in Kubernetes.&lt;/p&gt;

&lt;p&gt;When a node is created, the kubelet delegates:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Creating the container to the Container Runtime.&lt;/li&gt;
&lt;li&gt;Attaching the container to the network to the CNI.&lt;/li&gt;
&lt;li&gt;Mounting volumes to the CSI.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdenm4d0ykohtpblhgwnh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdenm4d0ykohtpblhgwnh.png" alt="The kubelet delegates tasks to the CRI, CNI and CSI" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Let's focus on the CNI part.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Each pod has its own isolated Linux network namespace and is attached to a bridge.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The CNI is responsible for creating the bridge, assigning the IP and connecting veth0 to the cni0.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2shsdg5nfdqvcqr8bs6j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2shsdg5nfdqvcqr8bs6j.png" alt="In most cases, all containers on a node are connected to a network bridge" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This usually happens, but different CNIs might use other means to connect the container to the network.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;As an example, there might not be a cni0 bridge.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The AWS-CNI is an example of such a CNI.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F34ibgvxzp3b5pgsdi794.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F34ibgvxzp3b5pgsdi794.png" alt="Not all CNI use a bridge to connect the containers on the same node" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In AWS, each EC2 instance can have multiple network interfaces (ENIs).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You can assign a limited number of IPs to each ENI.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For example, an &lt;code&gt;m5.large&lt;/code&gt; can have up to 10 IPs for ENI.&lt;/p&gt;

&lt;p&gt;Of those 10 IPs, you have to assign one to the network interface.&lt;/p&gt;

&lt;p&gt;The rest you can give away.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fueu12l3nael0k9xsq319.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fueu12l3nael0k9xsq319.png" alt="Elastic Network interfaces and IP addresses" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Previously, you could use the extra IPs and assign them to Pods.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;But there was a big limit: the number of IP addresses.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Let's have a look at an example.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;With an &lt;code&gt;m5.large&lt;/code&gt;, you have up to 3 ENIs with 10 IP private addresses each.&lt;/p&gt;

&lt;p&gt;Since one IP is reserved, you're left with 9 per ENI (or 27 in total).&lt;/p&gt;

&lt;p&gt;That means that your &lt;code&gt;m5.large&lt;/code&gt; could run up to 27 Pods.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Not a lot.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm0l2kagl443gjppj7iy4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm0l2kagl443gjppj7iy4.png" alt="You can have up to 27 pods in a m5.large" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But AWS released a change to EC2 that allows "prefixes" to be assigned to network interfaces.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Prefixes what?!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In simple words, ENIs now support a range instead of a single IP address.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If before you could have 10 private IP addresses, now you can have 10 slots of IP addresses.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;And how big is the slot?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;By default, 16 IP addresses.&lt;/p&gt;

&lt;p&gt;With 10 slots, you could have up to 160 IP addresses.&lt;/p&gt;

&lt;p&gt;That's a rather significant change!&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Let's have a look at an example.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkhzpztenvq8fz76iwefq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkhzpztenvq8fz76iwefq.png" alt="Addresses prefix in EC2: before and after" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With an &lt;code&gt;m5.large&lt;/code&gt;, you have 3 ENIs with 10 slots (or IPs) each.&lt;/p&gt;

&lt;p&gt;Since one IP is reserved for the ENI, you're left with 9 slots.&lt;/p&gt;

&lt;p&gt;Each slot is 16 IPs, so &lt;code&gt;9*16=144&lt;/code&gt; IPs.&lt;/p&gt;

&lt;p&gt;Since there are 3 ENIs, &lt;code&gt;144x3=432&lt;/code&gt; IPs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You can have up to 432 Pods now (vs 27 before).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fld9jnkmrn7veayziwszs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fld9jnkmrn7veayziwszs.png" alt="You can have up to 432 pods in a m5.large" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The AWS-CNI support slots and caps the max number of Pods to 110 or 250, so you won't be able to run 432 Pods on an &lt;code&gt;m5.large&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It's also worth pointing out that this is not enabled by default — not even in newer clusters.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Perhaps because only nitro instances support it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Assigning slots it's great until you realize that the CNI gives 16 IP addresses at once instead of only 1, which has the following implications:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Quicker IP space exhaustion.&lt;/li&gt;
&lt;li&gt;Fragmentation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Let's review those.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fafyu6no7xtsz6vgkb040.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fafyu6no7xtsz6vgkb040.png" alt="Issue with prefixes in EC2 and EKS" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A pod is scheduled to a node.&lt;/p&gt;

&lt;p&gt;The AWS-CNI allocates 1 slot (16 IPs), and the pod uses one.&lt;/p&gt;

&lt;p&gt;Now imagine having 5 nodes and a deployment with 5 replicas.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What happens?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fph76062iqikczay06wcu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fph76062iqikczay06wcu.png" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Kubernetes scheduler prefers to spread the pods across the cluster.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Likely, each node receives 1 pod, and the AWS-CNI allocates 1 slot (16 IPs).&lt;/p&gt;

&lt;p&gt;You allocated &lt;code&gt;5*15=75&lt;/code&gt; IPs from your network, but only 5 are used.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb9zjz4x9hwh22x8mjrcx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb9zjz4x9hwh22x8mjrcx.png" alt="IP allocations with the AWS CNI" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;But there's more.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Slots allocate a contiguous block of IP addresses.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If a new IP is assigned (e.g. a node is created), you might have an issue with fragmentation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3vfzo36aqswvbelmqn2f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3vfzo36aqswvbelmqn2f.png" width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;How can you solve those?&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/premiumsupport/knowledge-center/eks-multiple-cidr-ranges/" rel="noopener noreferrer"&gt;You can assign a secondary CIDR to EKS.&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/subnet-cidr-reservation.html" rel="noopener noreferrer"&gt;You can reserve IP space within a subnet for exclusive use by slots.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Relevant links:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#AvailableIpPerENI" rel="noopener noreferrer"&gt;https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html#AvailableIpPerENI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/containers/amazon-vpc-cni-increases-pods-per-node-limits/" rel="noopener noreferrer"&gt;https://aws.amazon.com/blogs/containers/amazon-vpc-cni-increases-pods-per-node-limits/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-prefix-eni.html#ec2-prefix-basics" rel="noopener noreferrer"&gt;https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-prefix-eni.html#ec2-prefix-basics&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And finally, if you've enjoyed this thread, you might also like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Kubernetes workshops that we run at Learnk8s &lt;a href="https://learnk8s.io/training" rel="noopener noreferrer"&gt;https://learnk8s.io/training&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;This collection of past threads &lt;a href="https://twitter.com/danielepolencic/status/1298543151901155330" rel="noopener noreferrer"&gt;https://twitter.com/danielepolencic/status/1298543151901155330&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;The Kubernetes newsletter I publish every week &lt;a href="https://learnk8s.io/learn-kubernetes-weekly" rel="noopener noreferrer"&gt;https://learnk8s.io/learn-kubernetes-weekly&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
