Kubernetes and Reliability

#emptystring

Over the years of building softwares and deploying them in production, I never thought of reliability, until I did, using Kubernetes. There is no right path to Kubernetes success; instead there are few best practices that cad lead to high reliability.

It has become harder and harder these days for organisation to scale as they lack the adoption of Kubernetes. I want to highlight on how to achieve high reliability using Kubernetes in your organisation and what were traditional practices that we should stop following. If I'm able to do just right that, my goal would be fulfilled in to helping you make informed decision.

Below are four pillers of reliability I want to discuss,

Resource Limits and AutoScaling
Liveness & Readiness probes
Adopting CI/CD pipelines
Use managed k8s cloud services

Resource Requests, Limits and AutoScaling

Kubernetes Scheduler does it's job of scheduling pods to relative nodes by identifying pod's resource request and limit. If a single pod takes away all the resources of node then other pods will face the challenge of getting restarted.

If the node where a Pod is running has enough of a resource available, it's possible (and allowed) for a container to use more resource than its request for that resource specifies. However, a container is not allowed to use more than its resource limit.

As a pod can have multiple containers, we can define resource requests and limits at container level in each pod. Setting limits to Pods can increase Node Reliability as not a single pod is consuming all the resources, and if they exceed using allocated memory then system kernal identifies the process and kills it with out of memory(OOM) error.

Autoscaling on the other hand is achieved with pods and cluster based on the resource usage. With horizontal pod autoscaling (HPA), number of replicas can be increased to pre-defined set HPA parameter.

Liveness & Readiness probes

"Self-healing" is the word that comes in mind when I think about liveness and readiness probes. If a process is stuck doing some computation and that renders the pod unresponsive, that in turn will fail to complete the requirements of either liveness or readiness probe, and in-turn force Pod to be killed by Kubernetes.

A liveness probe means Pod is running or alive. A readiness probe, on the other hand, used to indicate the pod is ready to serve traffic or not.

Probes makes sure that Kubernetes cluster performs some action on your containers at set intervals. Each probe has two states, pass and fail, along with a threshold for how many times the probe has to fail or succeed before the state is changed. Once they are configured correctly on all of your containers, these two probe types provide the cluster with the ability to “self-heal.” Problems that arise in containers will be automatically detected, and pods will be killed or taken out of service automatically.

Here, we can understand that reliability is stability of services. You can find your right configuration with your set of requirements.

Adopting CI/CD pipelines

Pods are ephemeral. K8s is great when used in a declarative way. All Deployments should have a service exposed.

Did you notice where I'm going with this? It's not a good practice to change anything at runtime in a live production cluster. Not at all!

What's the alternative? Continuous integration and Continuous Deployment. You CI your project by building new image on a code changes pushed to certain branches. CD would then deploy your new image to your k8s cluster. Here, it's not just making changes to particular POD, It's about applying your new k8s objects. In a way, your k8s objects are also version controlled and passes various quality gates before making it to production.

Use managed k8s cloud services

Running Kubernetes on baremetal includes several tasks: from cluster setup to cluster hardening, administration, backup, scaling, provisioning resources, configuring applications and last but not least app deployment. Here, you will get full flexibility for your OS, container runtime, no provider lock-ins, lowest price possible; But wait before you want to jump on this wagon right now, the effort hidden in the bag of tasks above shouldn’t be underestimated.

There goes a lot of efforts in managing worker node alone; don't try to manage control plane by yourself. Not to mention, version upgrade alone is a piece of work.

Managed Kubernetes solution such as GKE(Google Kubernetes Engine) and Amazon's Elastic Kubernetes Service provides a very nice UI to design your cluster with autoscaling, networking, security, disks. Though, it is wise to understand the degree to which these all Kubernetes offerings are “managed”—in the sense that you get operational support as well as hosting infrastructure—varies considerably.

Bottom line is that your business is running 24/7 reliably on Kubernetes as it has capability of self healing, can scale horizontally and monitoring your applications in Kubernetes is easy to explore. I might just explore that in my next Blog post!

Thank you for reading!

DEV Community

Kubernetes and Reliability

Top comments (0)