DEV Community: waswani

Hosting workload on the right node in Kubernetes

waswani — Thu, 02 Feb 2023 18:37:59 +0000

Are you running Kubernetes as a Platform for the Application teams to run their services? If yes, then very soon you will have some of the typical requests coming from the Teams saying —

Our service is compute heavy and we want to get the service pods hosted on a Compute heavy node and not a generic one.
Pods of these two specific services should always be running on the same node.
This is an AI/ML specific service and it should always be running on a GPU based node.
My service can run on any node provided the node has SSD type storage.
And this list literally can just go on, and on, and on.

Or, maybe as a Platform team, you might collaborate with the Application teams to identify stateless workload and if it can be run on Spot instances, may be on lower environments if not in production. This helps you to implement the Cost saving strategy from Day 0. I know it sounds premature optimisation, but if properly planned, can be done right in the 1st attempt.

Well, if you are in this situation, nothing to panic as Kubernetes has Out of box features which can help you implement the above ask in a very clean way. These features in Kubernetes are known as —

Node selector or Node Affinity &
Taints and Tolerations

Let’s understand these two concepts and then we will jump straight into implementation.

Node selector is a construct where a Pod can help Kubernetes Scheduler to find the desired node on which the pod should be hosted using the labels attached to a node. Example —

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
    - name: nginx
      image: nginx
  nodeSelector:
    stack: frontend

In the above example, Pod manifest file specifies the node on which it should be deployed by saying — schedule me on a node which has label applied with key=stack and value=frontend.

But Node Selector does not support some of the complex situations where desired node selection involves multiple labels or if the node selection criteria mentioned is a soft or preferred one (meaning, if no matching node found, scheduler will still try to host the pod on a node which does match the defined conditions). And therefore Node Affinity was proposed, which can handle such tricky situations.

For more details on Node affinity, please check this link.

Taints — while Node selector or Node affinity is a property of Pod which helps Scheduler to find the desired nodes, Taints are opposite — they are property of a Node and is configured to repel a set of pods.

Tolerations — And tolerations are applied on Pods to nullify the taint effect, and helps Kubernetes Scheduler to schedule pods with matching taints.

Taints and Tolerations work hand in hand. Let’s see an example —

# Below command helps to apply a Taint on node1 -

kubectl taint nodes node1 cloudforbeginners.com/ssd-storage=true:NoSchedule

A taint follows a convention of Key=Value:Effect. In the above example, key of the taint is “cloudforbeginners.com/ssd-storage”, value of the taint is “true” and the effect is “NoSchedule”.

Here is a Pod manifest which is configured to tolerate this taint —

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx
  tolerations:
  - key: "cloudforbeginners.com/ssd-storage"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

In the above example, we used the effect as NoSchedule. The other possible options for this attribute are — PreferNoSchedule and NoExecute.

Effect helps Kubernetes Scheduler to decide -

What should happen to an existing pod when a taint is applied to a node at runtime. Should it continue to run on the node even if it does not tolerate the taint or it should be evicted.
If a new pod should be created on a node based on the Taints and Tolerations configured.

For further details on Taints and Tolerations, check this link

Now that we have understood the concept of Node Selectors/Affinity and Tains and Tolerations, it’s time to see them in action.

We will be using AWS EKS managed service for hosting our Kubernetes cluster and eksctl tool for creating the cluster.

We will create a Kubernetes Cluster with 3 different category of worker nodes and each of these nodes will have taints and labels applied as per our business need —

Category A — Compute optimised nodes

Labels (workload-type=compute-optimized)
Taints (cloudforbeginners.com/compute-optimized=true:NoExecute)

Category B— SSD optimised nodes

Labels (workload-type=ssd-optimized)
Taints (cloudforbeginners.com/ssd-storage=true:NoSchedule)

Category C — General purpose nodes

Labels (workload-type=generic)
Taints (cloudforbeginners.com/generic=true:NoExecute)

In AWS EKS service, logical grouping of worker nodes with specific configurations is managed via Node Group concept. And behind the scene, these Node Groups are managed using AWS AutoScaling service; one Autoscaling group per Node Group.

Once we launch our cluster, we then deploy multiple application services having different resource requirements and see how they leverage Kubernetes Labels with Taint and Toleration features to get themselves scheduled to the desired Node.

Let’s jump in —

Step 0 — Install the tools needed to create the Kubernetes infrastructure. The commands are tested with Linux OS.

# Install eksctl tool
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin

# Install kubectl tool
curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.23.13/2022-10-31/bin/linux/amd64/kubectl

# Install or update AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

Step 1 — Clone this Public repository from Github — https://github.com/waswani/kubernetes-labels-taints, navigate to the folder kubernetes-labels-taints and create EKS cluster using the below command —

# Create EKS Cluster with version 1.23
eksctl create cluster -f eks-cluster.yaml

# Output like below shows cluster has been successfully created
2023-01-31 19:19:19 [ℹ]  kubectl command should work with "/home/ec2-user/.kube/config", try 'kubectl get nodes'
2023-01-31 19:19:19 [✔]  EKS cluster "labels-taints-demo" in "us-west-2" region is ready

Check the labels and taints applied to the Nodes.

#Get nodes 
kubectl get nodes --show-labels | grep workload-type=compute-optimized

#Output (scroll towards the end to see the label)
ip-192-168-14-93.us-west-2.compute.internal    Ready    <none>   15m   v1.23.13-eks-fb459a0   alpha.eksctl.io/cluster-name=labels-taints-demo,alpha.eksctl.io/nodegroup-name=compute-optimized-workload,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.small,beta.kubernetes.io/os=linux,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup-image=ami-0d453cab46e7202b2,eks.amazonaws.com/nodegroup=compute-optimized-workload,eks.amazonaws.com/sourceLaunchTemplateId=lt-095b33994789c7a16,eks.amazonaws.com/sourceLaunchTemplateVersion=1,failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2a,k8s.io/cloud-provider-aws=3a1440dc044c748d3893b000ab850fc5,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-168-14-93.us-west-2.compute.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3.small,topology.kubernetes.io/region=us-west-2,topology.kubernetes.io/zone=us-west-2a,workload-type=compute-optimized
ip-192-168-53-19.us-west-2.compute.internal    Ready    <none>   15m   v1.23.13-eks-fb459a0   alpha.eksctl.io/cluster-name=labels-taints-demo,alpha.eksctl.io/nodegroup-name=compute-optimized-workload,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.small,beta.kubernetes.io/os=linux,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup-image=ami-0d453cab46e7202b2,eks.amazonaws.com/nodegroup=compute-optimized-workload,eks.amazonaws.com/sourceLaunchTemplateId=lt-095b33994789c7a16,eks.amazonaws.com/sourceLaunchTemplateVersion=1,failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2b,k8s.io/cloud-provider-aws=3a1440dc044c748d3893b000ab850fc5,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-168-53-19.us-west-2.compute.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3.small,topology.kubernetes.io/region=us-west-2,topology.kubernetes.io/zone=us-west-2b,workload-type=compute-optimized

kubectl get nodes --show-labels | grep workload-type=generic

#Output (scroll towards the end to see the label)
ip-192-168-68-185.us-west-2.compute.internal   Ready    <none>   15m   v1.23.13-eks-fb459a0   alpha.eksctl.io/cluster-name=labels-taints-demo,alpha.eksctl.io/nodegroup-name=generic-workload,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.small,beta.kubernetes.io/os=linux,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup-image=ami-0d453cab46e7202b2,eks.amazonaws.com/nodegroup=generic-workload,eks.amazonaws.com/sourceLaunchTemplateId=lt-0afd5658b9ad2b18c,eks.amazonaws.com/sourceLaunchTemplateVersion=1,failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2c,k8s.io/cloud-provider-aws=3a1440dc044c748d3893b000ab850fc5,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-168-68-185.us-west-2.compute.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3.small,topology.kubernetes.io/region=us-west-2,topology.kubernetes.io/zone=us-west-2c,workload-type=generic
ip-192-168-9-183.us-west-2.compute.internal    Ready    <none>   15m   v1.23.13-eks-fb459a0   alpha.eksctl.io/cluster-name=labels-taints-demo,alpha.eksctl.io/nodegroup-name=generic-workload,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.small,beta.kubernetes.io/os=linux,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup-image=ami-0d453cab46e7202b2,eks.amazonaws.com/nodegroup=generic-workload,eks.amazonaws.com/sourceLaunchTemplateId=lt-0afd5658b9ad2b18c,eks.amazonaws.com/sourceLaunchTemplateVersion=1,failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2a,k8s.io/cloud-provider-aws=3a1440dc044c748d3893b000ab850fc5,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-168-9-183.us-west-2.compute.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3.small,topology.kubernetes.io/region=us-west-2,topology.kubernetes.io/zone=us-west-2a,workload-type=generic

kubectl get nodes --show-labels | grep workload-type=ssd-optimized

#Output (scroll towards the end to see the label)
ip-192-168-2-176.us-west-2.compute.internal    Ready    <none>   16m   v1.23.13-eks-fb459a0   alpha.eksctl.io/cluster-name=labels-taints-demo,alpha.eksctl.io/nodegroup-name=ssd-optimized-workload,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.small,beta.kubernetes.io/os=linux,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup-image=ami-0d453cab46e7202b2,eks.amazonaws.com/nodegroup=ssd-optimized-workload,eks.amazonaws.com/sourceLaunchTemplateId=lt-0b9ff8f227fef842c,eks.amazonaws.com/sourceLaunchTemplateVersion=1,failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2a,k8s.io/cloud-provider-aws=3a1440dc044c748d3893b000ab850fc5,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-168-2-176.us-west-2.compute.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3.small,topology.kubernetes.io/region=us-west-2,topology.kubernetes.io/zone=us-west-2a,workload-type=ssd-optimized
ip-192-168-91-138.us-west-2.compute.internal   Ready    <none>   17m   v1.23.13-eks-fb459a0   alpha.eksctl.io/cluster-name=labels-taints-demo,alpha.eksctl.io/nodegroup-name=ssd-optimized-workload,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=t3.small,beta.kubernetes.io/os=linux,eks.amazonaws.com/capacityType=ON_DEMAND,eks.amazonaws.com/nodegroup-image=ami-0d453cab46e7202b2,eks.amazonaws.com/nodegroup=ssd-optimized-workload,eks.amazonaws.com/sourceLaunchTemplateId=lt-0b9ff8f227fef842c,eks.amazonaws.com/sourceLaunchTemplateVersion=1,failure-domain.beta.kubernetes.io/region=us-west-2,failure-domain.beta.kubernetes.io/zone=us-west-2c,k8s.io/cloud-provider-aws=3a1440dc044c748d3893b000ab850fc5,kubernetes.io/arch=amd64,kubernetes.io/hostname=ip-192-168-91-138.us-west-2.compute.internal,kubernetes.io/os=linux,node.kubernetes.io/instance-type=t3.small,topology.kubernetes.io/region=us-west-2,topology.kubernetes.io/zone=us-west-2c,workload-type=ssd-optimized

#To check the Taints on the nodes
for kube_node in $(kubectl get nodes | awk '{ print $1 }' | tail -n +2); do
 echo ${kube_node} $(kubectl describe node ${kube_node} | grep Taint);
done

#Output
ip-192-168-14-93.us-west-2.compute.internal Taints: cloudforbeginners.com/compute-optimized=true:NoExecute
ip-192-168-2-176.us-west-2.compute.internal Taints: cloudforbeginners.com/ssd-storage=true:NoSchedule
ip-192-168-53-19.us-west-2.compute.internal Taints: cloudforbeginners.com/compute-optimized=true:NoExecute
ip-192-168-68-185.us-west-2.compute.internal Taints: cloudforbeginners.com/generic=true:NoExecute
ip-192-168-9-183.us-west-2.compute.internal Taints: cloudforbeginners.com/generic=true:NoExecute
ip-192-168-91-138.us-west-2.compute.internal Taints: cloudforbeginners.com/ssd-storage=true:NoSchedule

Step 2 — Let us now verify if the application pods are getting deployed to the desired node as expected.

But before that, let’s see what happens, if we just try to deploy a pod with no tolerations configured.

# Create a nginx pod with no Toleration configured
kubectl run nginx --image=nginx

kubectl get pods
NAME    READY   STATUS    RESTARTS   AGE
nginx   0/1     Pending   0          18s

If you see the Pod status, it shows as Pending. And to see the reason, execute below command —

kubectl describe pod nginx

Name:         nginx
Namespace:    default
Priority:     0
Node:         <none>
Labels:       run=nginx
Annotations:  kubernetes.io/psp: eks.privileged
Status:       Pending
IP:           
IPs:          <none>
Containers:
  nginx:
    Image:        nginx
    Port:         <none>
    Host Port:    <none>
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-jb4z4 (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-jb4z4:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  29s   default-scheduler  0/6 nodes are available: 2 node(s) had taint {cloudforbeginners.com/compute-optimized: true}, that the pod didn't tolerate, 2 node(s) had taint {cloudforbeginners.com/generic: true}, that the pod didn't tolerate, 2 node(s) had taint {cloudforbeginners.com/ssd-storage: true}, that the pod didn't tolerate.

Because the pod did not have any of the toleration configured against the tainted nodes, Kubernetes scheduler could not find any target node where this pod could be assigned.

Let’s now deploy a pod which is configured to tolerate the taint of “cloudforbeginners.com/compute-optimized=true:NoExecute” and has node affinity selected with label “workload-type=compute-optimized”

kubectl apply -f nginx-with-compute-toleration-and-label.yaml 

#To get the status of the Pod
kubectl get pods -o wide

#Output
NAME                             READY   STATUS    RESTARTS   AGE   IP              NODE                                          NOMINATED NODE   READINESS GATES
nginx-with-toleration-no-label   1/1     Running   0          48s   192.168.19.30   ip-192-168-14-93.us-west-2.compute.internal   <none>           <none>

And the node ip-192–168–14–93.us-west-2.compute.internal is the one which is tainted with compute-optimized.

Similarly, we can create pods with ssd or generic workload node selector.

In a nut shell, Node selector or Node Affinity is for Pods to select the Node and Taints is for Node to allow or deny the Pod from getting hosted on them.

Hope you enjoyed reading this blog. Do share this blog with your friends if this has helped you in any way.

Happy Blogging…Cheers!!!

Enforce Security and Governance in Kubernetes using OPA Gatekeeper

waswani — Sat, 28 Jan 2023 08:32:26 +0000

As a Platform Team, when you host Kubernetes Cluster for the rest of application teams to run their services, there are few things that you expect the Teams to follow to make everyone’s life easier. For instance — you expect every team to define —

CPU/Memory requirement for their application pods [Governance Policy]
The minimum set of labels as per Organization standard. Like — application name, cost center, etc. [Governance Policy]
Image repository should be from the approved list and not just any public repository [Security Policy]
On Development environment, the replica count should always be set to 1 — may be to save on the cost [Governance Policy, to save on Cost]
And the list goes on and on…

This is a fair ask from Teams to follow the best practices, but we all know, based on prior experience, until you enforce these rules, things will not be in place. And this blogs talks exactly about this —

"How to enforce Security and Governance policies to have fine grain control on the services running in a Kubernetes Cluster"

Logically, here is what we would need to achieve it —

Step #1 — A component which can understand the Governance and/or Security related policies and apply them as needed.

Step #2 — A way to define the policies we want to enforce

Step #3 — Evaluation of these polices whenever a Kubernetes resource is getting created/edited/deleted and if all looks good, let the resource be created, else flag the error.

BTW, if a resource is found not to follow the guidelines from the set of policies, one has an option to fail the resource creation OR, make changes to the resource manifest file at runtime based on the policy and then let the resource be created.

Let's do a bit of deep dive into each of the steps -

For Step #1, we will use Open Policy Agent (referred as OPA from here on), an open source Policy Engine which can help to enforce policies across the stack.

For more details on OPA, check this link —

We will be using Gatekeeper to implement the OPA.

Thinking about Gatekeeper??? OPA Gatekeeper is a specialised project providing first-class integration between OPA and Kubernetes. It leverages OPA Constraint framework to describe and enforce policy. OPA Constraint has a semantic of Constraint and ConstraintTemplate. Let’s understand what it means by taking an example.

Assume, we want to enforce a policy to have a minimum set of labels needed on a Kubernetes resource.

For this to work, we will first define a ConstraintTemplate, which will describe —

The constraint we want to enforce, written in Rego
The schema of the constraint

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: requiredlabelsconstrainttemplate
  annotations:
    metadata.gatekeeper.sh/title: "Required Labels"
    metadata.gatekeeper.sh/version: 1.0.0
spec:
  crd:
    spec:
      names:
        kind: RequiredLabelsConstraintTemplate
      validation:
        openAPIV3Schema:
          type: object
          properties:
            labels:
              type: array
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package requiredlabelsconstrainttemplate
        violation[{"msg": msg, "details": {"missing_labels": missing}}] {
          provided := {label | input.review.object.metadata.labels[label]}
          required := {label | label := input.parameters.labels[_]}
          missing := required - provided
          count(missing) > 0
          msg := sprintf("Kubernetes resource must have following labels: %v", [missing])
        }

And then we define the actual Constraints, which will be used by Gatekeeper to enforce the policy. And this Constraint is based on the ConstraintTemplate we have defined above.

Example — Below Constraint enforces costCenter, serviceName and teamName labels on a Pod resource.

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: RequiredLabelsConstraintTemplate
metadata:
  name: pod-required-labels
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Pod"]
  parameters:
    labels: ["teamName","serviceName","costCenter"]

And below one enforces the “purpose” label on a Namespace resource.

apiVersion: constraints.gatekeeper.sh/v1beta1
kind: RequiredLabelsConstraintTemplate
metadata:
  name: namespace-required-labels
spec:
  match:
    kinds:
      - apiGroups: [""]
        kinds: ["Namespace"]
  parameters:
    labels: ["purpose"]

Both the Constraints defined above leverage the same ConstraintTemplate and use match fields to help Gatekeeper to identify which Constraint to be applied for which Kubernetes resource.

For Step #2 , we will be using Rego, a purpose-built declarative policy language that supports OPA.

If we want to define a Policy to ensure that expected labels are present in the Pods to be created, this is how it would look like.

For more details on Rego, check this link —

For Step #3, we will leverage the Admission Control construct of Kubernetes. Admission Controller can be thought of as an interceptor that intercepts (authenticated) API requests and may change the request object or deny the request altogether. They are implemented via Webhook mechanism.

Admission Controllers can be of two types, Mutating and Validating. Mutating Admission Controller can mutate the resource definition and Validation Admission Controller just validates the final resource definition before the object is saved to etcd.

For more details on the Kubernetes Admission Controller, check this link —

So, our final flow would be something like this now —

The Platform Team installs OPA Gatekeeper in the Kubernetes Cluster.
The Platform Team also creates Policies as Constraints and ConstraintTemplates and applies these in Kubernetes Cluster.
Application Team triggers service deployment pipeline to deploy their service in Kubernetes.
At the time of pod creation, Kubernetes Admission Controller intercepts and uses OPA Gatekeeper to validate the request.
OPA Gatekeeper checks the list of Constraints deployed in the Cluster and if a match is found against the Kubernetes resource to be created, it performs the validation by evaluating the policy written in Rego inside the corresponding ConstraintTemplate.
And based on the evaluation, Gatekeeper responds with an Allow or Deny response to the Kubernetes API Server.

Let’s see things in action —

We will be using AWS EKS managed service for hosting our Kubernetes cluster and eksctl tool for creating the cluster.

Step #0 — Install the tools needed to create the Kubernetes infrastructure. The commands are tested with Linux OS.

# Install eksctl tool
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin

# Install kubectl tool
curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.23.13/2022-10-31/bin/linux/amd64/kubectl

# Install or update AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

Step #1 — Create EKS cluster using eksctl tool and deploy the Kubernetes Metrics service inside it.

Clone this Public repository from Github — https://github.com/waswani/kubernetes-opa, navigate to the folder kubernetes-opa and execute commands as mentioned below —

# Create EKS Cluster with version 1.24
eksctl create cluster -f eks-cluster.yaml

# Output like below shows cluster has been successfully created
2023-01-24 18:25:50 [ℹ]  kubectl command should work with "/home/ec2-user/.kube/config", try 'kubectl get nodes'
2023-01-24 18:25:50 [✔]  EKS cluster "opa-demo" in "us-west-2" region is ready

Step #2 — Deploy OPA Gatekeeper

kubectl apply -f gatekeeper.yaml 

# Check the pods deployed in the gatekeeper-system namespace
kubectl get pods -n gatekeeper-system

NAME                                             READY   STATUS    RESTARTS   AGE
gatekeeper-audit-5cd987f4b7-b24qk                1/1     Running   0          10m
gatekeeper-controller-manager-856954594c-2bbsc   1/1     Running   0          10m
gatekeeper-controller-manager-856954594c-7qnft   1/1     Running   0          10m
gatekeeper-controller-manager-856954594c-8szl9   1/1     Running   0          10m

Step #3 — Define ConstraintTemplate and deploy it into Kubernetes cluster.

We will use the same template that we created earlier in the blog for enforcing labels.

kubectl apply -f requiredLabelsConstraintTemplate.yaml

#To check installed templates
kubectl get constrainttemplate

#Output
NAME                               AGE
requiredlabelsconstrainttemplate   40s

Step #4 — Define Constraint and deploy it into Kubernetes cluster.

We will use the same constraints that we created earlier in the blog for enforcing labels on Pods and Namespace Kubernetes objects.

#For Pod
kubectl apply -f podRequiredLabelsConstraint.yaml 

#For Namespace
kubectl apply -f namespaceRequiredLabelsConstraint.yaml

#To check installed constraints
kubectl get constraints

#Output
NAME                        ENFORCEMENT-ACTION   TOTAL-VIOLATIONS
namespace-required-labels                        
pod-required-labels                              13

Step #5 — It’s time to test by creating Pods and Namespaces with and without required labels and see the outcome.

Create Pod with no labels applied —

run nginx --image=nginx

#Output - Failed pod creation, as expected
Error from server (Forbidden): admission webhook "validation.gatekeeper.sh" denied the request: [pod-required-labels] Kubernetes resource must have following labels: {"costCenter", "serviceName", "teamName"}

Create Pod with required labels applied —

kubectl run nginx --image=nginx --labels=costCenter=Marketing,teamName=HR,serviceName=HRFrontend

#Output - Pod created successfully
pod/nginx created

Create Namespace with no labels applied —

kubectl create ns observability

#Output, creation failed
Error from server (Forbidden): admission webhook "validation.gatekeeper.sh" denied the request: [namespace-required-labels] Kubernetes resource must have following labels: {"purpose"}

Create Namespace with required labels applied —

kubectl apply -f test-ns.yaml 

#Output, NS created as it has the label "Purpose" applied to it.
namespace/test created

To summarise, we created a single Constraint Template file to enforce mandatory labels and depending on the type of Kubernetes resource for whom we want to enforce the labels, we created constraint files and enforced the labels as per our need. And all this was possible with the help of OPA Gatekeeper.

Hope you enjoyed reading this blog. Do share this blog with your friends if this has helped you in any way.

Happy Blogging…Cheers!!!

Autoscaling Nodes in Kubernetes

waswani — Sun, 01 Jan 2023 05:46:04 +0000

Continuing with our journey of Horizontal Scaling in Kubernetes, here is another blog which will focus on auto scaling of Nodes in Kubernetes cluster. This semantic is also referred to as Cluster AutoScaler (CA).

Please refer to this blog to get more context on how we managed Horizontal Pod Scaling in Kubernetes - Horizontal Pod Scaling in Kubernetes

Though the concept is applicable across all major Cloud Providers, I will be using AWS Cloud provider with AWS EKS managed service as Kubernetes Platform for running the sample code.

To set the context once again — Assume as a Platform Team, you are running a Kubernetes Cluster in a cloud environment and the Application Teams hosting their services have asked you to ensure that their workload pods should automatically scale out as the traffic spikes. You make them aware of the concept of Horizontal Pod Scaling construct and the team implements it. Everyone is happy, but very soon you get into a situation where workload pods while scaling out are getting into Pending state as Kubernetes Cluster does not have enough resources available to schedule the Pods.

To handle this situation, you need a mechanism to automatically add more nodes to the Kubernetes Cluster which can help to ensure that pods don’t go into Pending state. And similarly, if the pods are scaling in because of low traffic, the nodes are also getting scaled in (think terminated).

To implement this semantic is straightforward but honestly, there are quite a few moving parts which need to be managed to get this thing done the right way.

Let’s see what would it take to implement it with AWS EKS service being our Kubernetes environment —

Logical Flow

We need a component (let’s call this component as Cluster Autoscaler from here on) which can monitor if Kubernetes Pods are going in Pending state because of resource crunch. This component needs Kubernetes RBAC access to monitor the pods state.
Once Cluster Autoscaler realizes that pods are going into Pending state because of resource shortage, it launches a new node (EC2 instance). This new node gets attached to the Kubernetes cluster. Note — To launch a new node, this component needs AWS permissions to perform the action. Will talk more on this in some time. Hold on to your thoughts until then !!!
Kubernetes pod Scheduler then starts scheduling the pods to the newly available node(s).

Implementation —

We will be using AWS EKS managed service for hosting our Kubernetes cluster and eksctl tool for creating the cluster.

Step 0 — Install the tools needed to create the Kubernetes infrastructure. The commands are tested with Linux OS.

# Install eksctl tool
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin

# Install kubectl tool
curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.23.13/2022-10-31/bin/linux/amd64/kubectl

# Install or update AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

Step 1— Create EKS cluster using eksctl tool and deploy the Kubernetes Metrics service inside it.

Clone this Public repository from Github — https://github.com/waswani/kubernetes-ca, navigate to the folder kubernetes-ca and excute commands as mentioned below —

# Create EKS Cluster with version 1.23
eksctl create cluster -f eks-cluster.yaml

# Output like below shows cluster has been successfully created
2022-12-30 16:26:46 [ℹ]  kubectl command should work with "/home/ec2-user/.kube/config", try 'kubectl get nodes'
2022-12-30 16:26:46 [✔]  EKS cluster "ca-demo" in "us-west-2" region is ready

# Deploy the Metric server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Output of the above command looks something like below - 
serviceaccount/metrics-server created
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created
clusterrole.rbac.authorization.k8s.io/system:metrics-server created
rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader created
clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator created
clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server created
service/metrics-server created
deployment.apps/metrics-server created
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created

Step 2 — Create IAM role which can be assumed by Cluster Autoscaler. It helps launch more nodes or delete existing nodes as traffic for the deployed workload goes up and down.

Let’s get a bit into behind the scenes of this step —

When you launch an AWS EKS cluster using the eksctl tool, an AWS Auto Scaling Group (referred as ASG from here on) gets created for each node group configured in the configuration file. We have just one nodeGroup mentioned in the config file, but you can mention multiple. Are you thinking why would one need more than one nodeGroup??? My next blog is going to cover exactly this. Please wait until then.
The ASG created has min, max and desired attributes configured as per the EKS configuration file.
Cluster Autoscaler needs AWS privileges to change the desired attribute of AWS ASG to scale the nodes in or out.
There are two possible ways by which we can assign AWS privileges to the Cluster Autoscaler component.
Option 1 — As our Cluster Autoscaler is going to run as a pod inside the EKS cluster, we can assign the required privileges to the IAM role which is assumed by EKS Cluster nodes. Cluster Autoscaler can then implicitly use that IAM role to perform the job. It works perfectly, but what it also means is that — any pod running inside the EKS cluster can use those privileges. And thus, it violates the principle of Least Privilege.
Option 2 — Assign just enough privileges directly to the Cluster Autoscaler pods in EKS. This concept is called IAM Roles for Service Account (referred as IRSA from here on).
We will use the 2nd approach to make Cluster Autoscaler work. This approach leverages the concept of Kubernetes Service Account and maps this Service Account to the IAM role which has the AWS privileges to launch nodes. This Service Account is made available into the Pod via Pod resource configuration.

When Cluster Autoscaler has a need to add/delete more nodes, it tries to update the AWS ASG desired attribute by calling the AWS ASG API. The AWS SDK behind the scene gets hold of temporary AWS credentials by calling AssumeRoleWithWebIdentity passing the JWT token which was issued to the Cluster Autoscaler pod during its creation.

For more details on IRSA, please check this blog — AWS EKS and the Principle of Least Privilege

Here is the AWS policy needed for manipulating ASG —

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "autoscaling:DescribeAutoScalingGroups",
                "autoscaling:DescribeAutoScalingInstances",
                "autoscaling:DescribeLaunchConfigurations",
                "autoscaling:DescribeTags",
                "autoscaling:SetDesiredCapacity",
                "autoscaling:TerminateInstanceInAutoScalingGroup",
                "ec2:DescribeInstanceTypes",
                "ec2:DescribeLaunchTemplateVersions"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

In the below code snippet, we are going to create AWS IAM policy using the JSON mentioned above, plus AWS IAM Role and Kubernetes Service account using — eksctl create iamserviceaccount command.

# Create IAM Policy
aws iam create-policy --policy-name ca-asg-policy --policy-document file://ca-asg-policy.json 

# Export the AWS Account ID to be used in the next command
export ACCOUNT_ID=$(aws sts get-caller-identity --output text --query Account)

# Create Kubernetes Service Account and IAM Role account using eksctl tool
eksctl create iamserviceaccount --name cluster-autoscaler-sa \
--namespace kube-system --cluster ca-demo \
--attach-policy-arn "arn:aws:iam::${ACCOUNT_ID}:policy/ca-asg-policy" \
--approve --override-existing-serviceaccounts

# To verify the Service account got created successfully, execute below command
kubectl describe serviceaccount cluster-autoscaler-sa -n kube-system

# And the output looks something like this
Name:                cluster-autoscaler-sa
Namespace:           kube-system
Labels:              app.kubernetes.io/managed-by=eksctl
Annotations:         eks.amazonaws.com/role-arn: arn:aws:iam::108756121336:role/eksctl-ca-demo-addon-iamserviceaccount-kube-Role1-22KGUNWA2E22
Image pull secrets:  <none>
Mountable secrets:   cluster-autoscaler-sa-token-gmmfz
Tokens:              cluster-autoscaler-sa-token-gmmfz
Events:              <none>

The output of the kubectl describe service account shows an important construct — Annotations. The mapping between the AWS IAM role and the service account is done using Annotations.

Step 3 — Deploy the Cluster Autoscaler component which is going to perform the magic for us.

# Deploy Cluster Autoscaler using below command
kubectl apply -f cluster-autoscaler-aws.yaml 

# The output shows all the Kubernetes native objects which gets 
# created as part of this deployment
clusterrole.rbac.authorization.k8s.io/cluster-autoscaler created
role.rbac.authorization.k8s.io/cluster-autoscaler created
clusterrolebinding.rbac.authorization.k8s.io/cluster-autoscaler created
rolebinding.rbac.authorization.k8s.io/cluster-autoscaler created
deployment.apps/cluster-autoscaler created

Key points to remember —

The Cluster Autoscaler Docker image to be used is dependent on the EKS Kubernetes version. For us, it’s 1.23.0, hence we used — k8s.gcr.io/autoscaling/cluster-autoscaler:v1.23.0
Service account created in step #2 is passed as serviceAccountName in the Cluster Autoscaler pod template.
An ASG which gets created as part of EKS cluster Node Group is configured with 2 standard tags mentioned below.

k8s.io/cluster-autoscaler/ca-demo = owned
k8s.io/cluster-autoscaler/enabled = true

These tags help Cluster Autoscaler to discover the ASG(s) which can be used for scaling. ASG which does not have these tags attached, will not be discovered by Cluster Autoscaler.

Step 4 — Let’s deploy the workload which will help to scale the cluster as pods start getting into pending state.

# Deploy the sample application
kubectl apply -f apache-php-deployment.yaml 

# Expose the deployed application via Service, listening on port 80
kubectl apply -f apache-php-service.yaml

# Create HPA
kubectl apply -f apache-php-hpa.yaml

We are using HPA concept from our previous blog — Horizontal Pod Scaling. Do take a look at that blog if you have any confusion around that.

Before you execute the next step, open 3 different terminals and fire below commands —

Terminal 1 — Get current nodes running in the cluster

kubectl get nodes -w

# The initial output would show you 2 nodes
NAME                                           STATUS   ROLES    AGE   VERSION
ip-192-168-43-152.us-west-2.compute.internal   Ready    <none>   20h   v1.23.13-eks-fb459a0
ip-192-168-88-36.us-west-2.compute.internal    Ready    <none>   20h   v1.23.13-eks-fb459a0

Terminal 2 — Get current pods running in the default namespace

kubectl get pods -w

# The output should you just single pod running
NAME                          READY   STATUS    RESTARTS   AGE
apache-php-65f9f58f4f-zrnnf   1/1     Running   0          6m45s

Terminal 3 — Get the Horizontal Pod Scale resource configured in the default namespace.

kubectl get hpa -w

# You should see something like this to start with
NAME         REFERENCE               TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
apache-php   Deployment/apache-php   0%/20%    1         10        1          37m

Step 5 — And now let’s increase the load on the service which will first trigger Horizontal Pod scale and then as Pods go in Pending state, Cluster Autoscaler comes into action and adds more nodes into the cluster and pods go in Running state.

# Launch a busybox pod and fire request to workload service in an infite loop
kubectl run -i --tty load-gen --image=busybox -- /bin/sh

# And from the shell command, fire below command
while true; wget -q -O - http://apache-php ; done

# And you should see output like this
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
...

Here is the observation —

As load increases, HPA tries to launch more Pods to ensure the avg. target CPU consumption is maintained at 20 %.
As HPA tries to add more pods, EKS Cluster goes out of CPU resources and Pods goes into Pending state.
Pods going into Pending state triggers Cluster Autoscaler to add more Nodes. And once nodes join the EKS cluster and come to Ready state, Pods which were in Pending state now go into Running state.

Look at the output of each Terminal once again —

Terminal 1 — Additional node got launched as part of Cluster Autoscaling.

kubectl get nodes -w

NAME                                           STATUS   ROLES    AGE   VERSION
ip-192-168-43-152.us-west-2.compute.internal   Ready    <none>   20h   v1.23.13-eks-fb459a0
ip-192-168-88-36.us-west-2.compute.internal    Ready    <none>   20h   v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    NotReady   <none>   0s    v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    NotReady   <none>   0s    v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    NotReady   <none>   0s    v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    NotReady   <none>   0s    v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    NotReady   <none>   2s    v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    NotReady   <none>   10s   v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    NotReady   <none>   18s   v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    Ready      <none>   31s   v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    Ready      <none>   31s   v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    Ready      <none>   32s   v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    Ready      <none>   62s   v1.23.13-eks-fb459a0

Terminal 2 — Get status of pods in the cluster.

kubectl get pods -w

NAME                          READY   STATUS    RESTARTS   AGE
apache-php-65f9f58f4f-zrnnf   1/1     Running   0          8m47s
load-gen                      1/1     Running   0          13s
apache-php-65f9f58f4f-q8qsg   0/1     Pending   0          0s
apache-php-65f9f58f4f-q8qsg   0/1     Pending   0          0s
apache-php-65f9f58f4f-mz4w4   0/1     Pending   0          0s
apache-php-65f9f58f4f-6j7mg   0/1     Pending   0          0s
apache-php-65f9f58f4f-mz4w4   0/1     Pending   0          0s
apache-php-65f9f58f4f-6j7mg   0/1     Pending   0          0s
apache-php-65f9f58f4f-q8qsg   0/1     ContainerCreating   0          0s
apache-php-65f9f58f4f-mz4w4   0/1     ContainerCreating   0          0s
apache-php-65f9f58f4f-6j7mg   0/1     ContainerCreating   0          0s
apache-php-65f9f58f4f-mz4w4   1/1     Running             0          2s
apache-php-65f9f58f4f-6j7mg   1/1     Running             0          2s
apache-php-65f9f58f4f-q8qsg   1/1     Running             0          2s
apache-php-65f9f58f4f-h8z2p   0/1     Pending             0          0s
apache-php-65f9f58f4f-h8z2p   0/1     Pending             0          0s
apache-php-65f9f58f4f-g45qg   0/1     Pending             0          0s
apache-php-65f9f58f4f-t6b6b   0/1     Pending             0          0s
apache-php-65f9f58f4f-g45qg   0/1     Pending             0          0s
apache-php-65f9f58f4f-t6b6b   0/1     Pending             0          0s
apache-php-65f9f58f4f-ncll2   0/1     Pending             0          0s
apache-php-65f9f58f4f-ncll2   0/1     Pending             0          0s
apache-php-65f9f58f4f-h8z2p   0/1     ContainerCreating   0          0s
apache-php-65f9f58f4f-g45qg   0/1     ContainerCreating   0          0s
apache-php-65f9f58f4f-h8z2p   1/1     Running             0          2s
apache-php-65f9f58f4f-g45qg   1/1     Running             0          2s
apache-php-65f9f58f4f-hctnf   0/1     Pending             0          0s
apache-php-65f9f58f4f-hctnf   0/1     Pending             0          0s
apache-php-65f9f58f4f-ncll2   0/1     Pending             0          75s
apache-php-65f9f58f4f-t6b6b   0/1     Pending             0          75s
apache-php-65f9f58f4f-hctnf   0/1     Pending             0          30s
apache-php-65f9f58f4f-ncll2   0/1     ContainerCreating   0          75s
apache-php-65f9f58f4f-t6b6b   0/1     ContainerCreating   0          75s
apache-php-65f9f58f4f-hctnf   0/1     ContainerCreating   0          30s
apache-php-65f9f58f4f-ncll2   1/1     Running             0          91s
apache-php-65f9f58f4f-t6b6b   1/1     Running             0          91s
apache-php-65f9f58f4f-hctnf   1/1     Running             0          47s

Terminal 3 — Additional pods were launched as part of Horizontal Pod scaling.

kubectl get hpa -w

NAME         REFERENCE               TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
apache-php   Deployment/apache-php   0%/20%    1         10        1          37m
apache-php   Deployment/apache-php   149%/20%   1         10        1          38m
apache-php   Deployment/apache-php   149%/20%   1         10        4          38m
apache-php   Deployment/apache-php   111%/20%   1         10        8          38m
apache-php   Deployment/apache-php   23%/20%    1         10        8          38m
apache-php   Deployment/apache-php   30%/20%    1         10        8          39m
apache-php   Deployment/apache-php   27%/20%    1         10        9          39m
apache-php   Deployment/apache-php   29%/20%    1         10        9          39m
apache-php   Deployment/apache-php   28%/20%    1         10        9          39m
apache-php   Deployment/apache-php   26%/20%    1         10        9          40m
apache-php   Deployment/apache-php   19%/20%    1         10        9          40m

Stop the busybox container which is bombarding the service and wait for few minutes, you would see the node(s) are getting terminated.

ip-192-168-28-98.us-west-2.compute.internal    Ready      <none>   17m     v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    Ready,SchedulingDisabled   <none>   17m     v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    Ready,SchedulingDisabled   <none>   17m     v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    NotReady,SchedulingDisabled   <none>   18m     v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    NotReady,SchedulingDisabled   <none>   18m     v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    NotReady,SchedulingDisabled   <none>   18m     v1.23.13-eks-fb459a0
ip-192-168-88-36.us-west-2.compute.internal    Ready                         <none>   21h     v1.23.13-eks-fb459a0
ip-192-168-43-152.us-west-2.compute.internal   Ready                         <none>   21h     v1.23.13-eks-fb459a0
ip-192-168-88-36.us-west-2.compute.internal    Ready                         <none>   21h     v1.23.13-eks-fb459a0
ip-192-168-43-152.us-west-2.compute.internal   Ready                         <none>   21h     v1.23.13-eks-fb459a0
ip-192-168-88-36.us-west-2.compute.internal    Ready                         <none>   21h     v1.23.13-eks-fb459a0
ip-192-168-43-152.us-west-2.compute.internal   Ready                         <none>   21h     v1.23.13-eks-fb459a0
ip-192-168-88-36.us-west-2.compute.internal    Ready                         <none>   21h     v1.23.13-eks-fb459a0
ip-192-168-43-152.us-west-2.compute.internal   Ready                         <none>   21h     v1.23.13-eks-fb459a0

Step 6 — One of the most important step, to delete the Kubernetes cluster.

# Delete EKS Cluster 
eksctl delete cluster -f eks-cluster.yaml

With this blog, we covered the Horizontal Scaling for Nodes in EKS Cluster. The next blogs will be covering —

Scale the Pods based on Custom Metrics
Get alert on Slack if Pods go in Pending state
Kubernetes to host services with different resource requirement. This will help to cover the point I mentioned above — why would you need to add more than one nodeGroup in the cluster configuration file???

Hope you enjoyed reading this blog. Do share this blog with your friends if this has helped you in any way.

Happy Blogging…Cheers!!!

Autoscaling Pods in Kubernetes

waswani — Sat, 31 Dec 2022 20:43:55 +0000

If you are hosting your workload in a cloud environment, and your traffic pattern is fluctuating in nature (think unpredictable), you need a mechanism to automatically scale out (and off-course scale in) your workload to ensure the service is able to perform as per defined Service Level Objective (SLO), without impacting the User Experience. This semantic is referred to as Autoscaling, to be very precise Horizontal Scaling.

Horizontal Scaling is the construct of adding/removing similar size (think replica) machines depending on demand/certain conditions. In the context of Kubernetes, it would mean — add more Pods or remove existing Pods.

Scaling can be of two types — Vertical and Horizontal. And this blog is focussed on Horizontal scaling.

By the way, there can be other components in the system which might impact the performance and user experience, but the focus of this blog is Compute layer, realised with Kubernetes pods.

Apart from ensuring that the service can handle load, there are couple of indirect benefits that gets added as part of Horizontal Scaling —

Cost — If you are running in a Cloud environment, one of the key charter is to run the workload in a cost effective manner. To address this, if we can dynamically run just enough workload pods needed to handle the traffic, then we can ensure that we are paying only for what we need. If we don’t leverage the AutoScaling construct, then we need to over provision the pods, and hence need to shell more dollars.
Better Resource Utilisation — If a Kubernetes workload is implemented with AutoScaling, it gives a fair opportunity to other workload pods to scale out should there be a need. And it would be tough to realize if the workloads are always running with a fixed number of Pods irrespective of the demand.

Now that we understand what is Horizontal Scaling and why we need it, the next logical question is — how do we make it work in Kubernetes environment?

Well, Kubernetes provides a native feature — HorizontalPodAutoscaler (referred as HPA from here on) which can help to horizontally scale the workload pods. The important question to ask is — When to scale?

Generally, scaling of the application is done based on one of the two metrics — CPU and Memory. HPA can be configured against these metrics to scale the pods and you are done. Not really !!!

There is still a prerequisite to make it work — you need to make these metrics available for the HPA to consume.

Some of the other metrics which can be used to scale the pods are — number of incoming requests, number of outgoing requests, message broker queue depth, etc.

Lost !!! Let’s make it a bit more clear in terms of how it works, logically with Kubernetes cluster —

Every Pod running in Kubernetes environment will be consuming CPU and Memory.
Make these metrics available as a centralised service for others to consume (referred as Metrics server from here on)
Configure HPA (Kubernetes native resource object) for a specific workload using the metric saying — If the CPU consumption reaches x%, scale the workload pod
HPA Kubernetes controller checks the Pod metrics with Metrics service at regular intervals and validates against the HPA resource configured for the pod and if the scaling condition matches, HPA controller updates the replica count of the workload’s Deployment resource (Kubernetes native object).
Deployment controller then takes the action depending on the replica count updated in the Deployment resource object.
And the pod replicas either increases (scale out) or decreases (scale in).

Kind of makes sense, but how do I implement the Metrics Service? There are multiple options available to implement the Metrics service. One of the easiest options is to go with Kubernetes Metrics Server.

Let’s see things in action and build a HPA for a workload running in Kubernetes.

We will be using AWS EKS managed service for hosting our Kubernetes cluster and eksctl tool for creating the cluster.

Step 0 — Install the tools needed to create the Kubernetes infrastructure. The commands are tested with Linux OS.

# Install eksctl tool
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin

# Install kubectl tool
curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.23.13/2022-10-31/bin/linux/amd64/kubectl

# Install or update AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

Step 1 — Clone this Public repository from Github — https://github.com/waswani/kubernetes-hpa, navigate to the folder kubernetes-hpa and create EKS cluster using the below command —

# Create EKS Cluster with version 1.23
eksctl create cluster -f eks-cluster.yaml

# Output like below shows cluster has been successfully created

2022-12-29 08:42:22 [ℹ]  kubectl command should work with "/home/ec2-user/.kube/config", try 'kubectl get nodes'
2022-12-29 08:42:22 [✔]  EKS cluster "hpa-demo" in "us-west-2" region is ready

For demo purposes, you can attach Administrator policy to AWS IAM user or Role used for launching the EKS cluster.

It will take roughly 15–20 minutes for the cluster to come up. The Data tier will have 3 nodes of size t3.small.

Step 2 — Deploy the Metrics Server in the Kubernetes infrastructure.

# Deploy the Metric server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Output of the above command looks something like below - 

serviceaccount/metrics-server created
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created
clusterrole.rbac.authorization.k8s.io/system:metrics-server created
rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader created
clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator created
clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server created
service/metrics-server created
deployment.apps/metrics-server created
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created

Step 3 — Deploy a sample workload in the default namespace. This workload is proxied by the service which listens on port 80.

# Deploy the sample application
kubectl apply -f apache-php-deployment.yaml 

# Expose the deployed application via Service, listening on port 80
kubectl apply -f apache-php-service.yaml

# Get pods running as part of Deployment 
NAME                          READY   STATUS    RESTARTS   AGE
apache-php-65498c4955-d2tbj   1/1     Running   0          8m55s

The service is exposed with the host name apache-php inside the Kubernetes cluster. The value of the replica is configured as 1, and hence only a single pod is running.

# To test if metrics server is responding well, execute below command and 
# you should see the current resource consumption of the pod which just got
# deployed using above deployment

kubectl top pod

# Output
NAME                          CPU(cores)   MEMORY(bytes)   
apache-php-65498c4955-d2tbj   1m           9Mi

Step 4 — Create HPA resource mentioning the condition for the auto scale to happen.

We will configure HPA with a condition of keeping the average CPU utilization of the pods at 50 %. If this value goes above 50 %, HPA will trigger an auto scale to add more pods.

# Create HPA
kubectl apply -f apache-php-hpa.yaml 

# Get HPA configured
NAME         REFERENCE               TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
apache-php   Deployment/apache-php   0%/50%    1         7         1          21s

Get HPA command shows that current CPU utilization is 0% and target scaling condition is configured at 50 %. It also shows the min, max and current replicas of the pod running.

Before executing next step, open another shell and fire below command to get hpa with watch flag -

kubectl get hpa -w

Step 5 — Bombard the workload service with requests and wait for the workload to auto scale out.

# Launch a busbox pod and fire request to workload service in an infite loop
kubectl run -i --tty load-gen --image=busybox -- /bin/sh

# And from the shell command, fire below command
while true; wget -q -O - http://apache-php ; done

# And you should see output like this
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
...

If you look at the outcome of the shell where you executed — kubectl get hpa command, you would see pods getting scaled as the load increases.

NAME         REFERENCE               TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
apache-php   Deployment/apache-php   0%/50%    1         7         1          4m41s
apache-php   Deployment/apache-php   0%/50%    1         7         1          5m16s
apache-php   Deployment/apache-php   3%/50%    1         7         1          5m31s
apache-php   Deployment/apache-php   0%/50%    1         7         1          6m1s
apache-php   Deployment/apache-php   352%/50%   1         7         1          6m46s
apache-php   Deployment/apache-php   455%/50%   1         7         4          7m1s
apache-php   Deployment/apache-php   210%/50%   1         7         7          7m16s
apache-php   Deployment/apache-php   31%/50%    1         7         7          7m31s
apache-php   Deployment/apache-php   50%/50%    1         7         7          7m46s
apache-php   Deployment/apache-php   59%/50%    1         7         7          8m1s
apache-php   Deployment/apache-php   46%/50%    1         7         7          8m16s
apache-php   Deployment/apache-php   54%/50%    1         7         7          8m31s
apache-php   Deployment/apache-php   57%/50%    1         7         7          8m46s
apache-php   Deployment/apache-php   59%/50%    1         7         7          9m1s
apache-php   Deployment/apache-php   52%/50%    1         7         7          9m16s

Are you thinking — towards the end, even if the CPU utilization is above 50 %, no more pods are getting launched??? The reason for that is — the maximum pods configured as part of HPA configuration is 7. Hence, HPA controller will not go beyond the maximum limit configured.

Step 6 — Stop the load and wait for the workload to auto scale in.

In the shell where you launched busybox and ran the while loop to hit the HTTP request, stop that process. And see the outcome of the — kubectl get hpa command.

apache-php   Deployment/apache-php   27%/50%    1         7         7          9m31s
apache-php   Deployment/apache-php   3%/50%     1         7         7          9m46s
apache-php   Deployment/apache-php   0%/50%     1         7         7          10m
apache-php   Deployment/apache-php   0%/50%     1         7         7          14m
apache-php   Deployment/apache-php   0%/50%     1         7         7          14m
apache-php   Deployment/apache-php   0%/50%     1         7         4          14m
apache-php   Deployment/apache-php   0%/50%     1         7         1          14m

The number of pods automatically scale in as the load decreases.

Step 7 — One of the most important step, to delete the Kubernetes cluster.

# Delete EKS Cluster 
eksctl delete cluster -f eks-cluster.yaml

# If you see an output like this, assume all has gone well :) 
2022-12-29 10:46:35 [ℹ]  will delete stack "eksctl-hpa-demo-cluster"
2022-12-29 10:46:35 [✔]  all cluster resources were deleted

In the above example, we scaled our workload based on CPU as a metric. But if you have a need to scale your workload on a custom metric, then you need to ensure that the specific metric is available for the HPA controller to consume via Metrics Server. One of the options to achieve it by using the Prometheus Adapter service.

On a related concern, what would happen if the maximum number of pods configured is not yet hit and Kubernetes cluster has run short of the CPU resources??? Logically, we would want additional nodes to be added automatically to the Kubernetes cluster else Pods will go in Pending state, impacting the Service performance.

Additionally, as a Platform engineer, you would also be interested to get an alert if a Pod is going into a Pending state because of whatever reason.

Hence, the next few blogs will cover some of the afore mentioned situations —

Automatically scale Kubernetes Cluster if pods are going into Pending state because of resource shortage - Blog
Scale the Pods based on Custom Metrics
Get alert on Slack if Pods go in Pending state

Hope you enjoyed reading this blog. Do share this blog with your friends if this has helped you in any way.

Happy Blogging…Cheers!!!