As users and other entities access your applications, resources get utilized. Whether it’s more CPU, memory, storage, or a containerized app reaches maxed out capacity because there are so many people attempting to access it.
When a server runs out of resources or a container is at max capacity, you need a way to add more resources or more containers to your environment. That way, apps and infrastructure don’t crash under heavy utilization.
That’s where scaling comes into play.
In this blog post, you’ll learn about what scaling is and how to horizontally scale your Kubernetes environment from a container and infrastructure perspective.
Let’s think about two examples - an eCommerce store and a newly minted app.
For the eCommerce store, think about what a popular one would look like on Cyber Monday. You can think of Amazon or Nike or any type of website online. As you can imagine, there are several backends, APIs, middleware, and frontends. On a day-to-day basis, the engineers more or less know what the load looks like. They set up the proper infrastructure so that if more users access the site, the app doesn’t go down. But what about Cyber Monday? The load could double or triple. Engineers need an automated, repeatable, and efficient way to get more infrastructure or new application servers, or new containers.
How about a smaller, yet nimble example? Say you’re building a new application. You’ve been prepping the MVP for months and you’re ready to deploy it to the world. You just so happen to be hosting this app in a containerized fashion on Kubernetes. Today’s the day. You’re ready to let it out into the world. The problem is, you have no idea how many people are going to use it. It could be one person or twenty thousand people. In any case, at this stage, it wouldn’t be efficient to create a bunch of worker nodes or replicas to handle the load that may occur. Instead, you would need an automated way to take care of this.
These two scenarios are where scaling comes into play.
Scaling gives you the ability to increase/decrease the number of Pods, the number of infrastructure components like worker nodes, or increase/decrease the number of resources (like CPU/memory) on Pods and clusters. If there are more users or more load on your Kubernetes environment, you can increase the number of Pods or the number of worker nodes to accommodate for the increase in load. Once the load goes down, you can decrease the number of Pods or the number of worker nodes. Scaling isn’t just for “going up”, it’s also for “going down”. A good scalability architecture always accounts for both.
Although you won’t see this quite often, there is another type of scaling - vertical scaling. Vertical scaling is a way to either scale resources like memory/CPU in a Pod or your cluster itself, like the worker nodes.
For example, if you had 50Mi memory for a Pod and wanted to bump it up to 100Mi due to load on the Pod, you can by using the Vertical Pod Autoscaler.
You can find out more about the Vertical Pod Autoscaler here: https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler
If you want to scale worker nodes, you can do that too. You can change the size of a worker node in the cloud or add more memory to a bare metal server.
As mentioned, vertical scaling doesn’t occur too much in production. However, it is an option and you should know about it.
Conceptually, Pod scaling comes from the Horizontal Pod Autoscaler, which is a Kubernetes API in the Core API Group. The resource/object is
HorizontalPodAutoscaler and you can use that to:
- Specify what Kubernetes resource you want to scale.
- Specify the minimum amount of replicas.
- Specify the maximum amount of replicas.
- Specify when a replica should be scaled based on utilization. For example, if Pods are at 75% capacity for CPU, they should be scaled up.
Please note that scaling doesn’t work with all Kubernetes resources. For example, because
DaemonSets are Pods that are deployed to every worker node, they do not work with Horizontal Pod Autoscaling.
Below is the first example of scaling. You can use the
autoscale command from
kubectl to specify your Kubernetes Deployment along with the utilization and min/max count. The problem with this approach is it’s imperative, and with Kubernetes, we want to stick with being as declarative as possible.
kubectl autoscale deployment nginx-deployment --cpu-percent=75 --min=2 --max=10
Below is the declarative and preferred method. It contains the same thing as the command above - the Deployment to scale, min/max count, and utilization.
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: nginxscale spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: nginx-deployment minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 75
Scaling clusters is where serious infrastructure needs come into play. When you’re working on a Kubernetes cluster and deploying apps, those containerized apps are running on worker nodes. Those worker nodes underneath the hood are servers. They could be running Windows or Linux. They could be virtual, in the cloud, or even bare metal. Regardless of how and where they are running, they have resources. Those resources are CPU, memory, storage, etc… As more containerized apps run on a Kubernetes cluster, more and more of those resources get consumed. Because of that, you need a way to automatically scale workloads.
Let’s take the example from the beginning of this blog post on scaling eCommerce. When you read about Horizontal Pod Autoscaling, that means more Pods are taking more resources on the worker nodes/servers. What happens when the servers have no more resources to give? That’s when a new worker node must be deployed. That way, the Pods have a place to go so they can run the containerized application successfully with the needed CPU and memory that the app needs.
You can find more about the Cluster Autoscaler here: https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler
Another popular platform for scaling clusters is Karpenter. It does the same thing from a functionality perspective as Cluster Autoscaler, but it’s only for AWS at this time. If you’re in another cloud, on-prem, or have a combination of both, Karpenter isn't the tool for you as there’s vendor lock-in.
After reading all of this, you may be wondering to yourself so what’s the con of autoscaling? and funny enough, there is a con.
When you’re scaling clusters on the fly or based on load, it can take those servers some time to get spun up. It’s not an instant thing. Because of that, you may face a problem where an application is down for “X” amount of time until the newly created worker nodes are prepared and join the Kubernetes cluster, at which point Pods can scale over to the new worker nodes.
This brings us to a concept called Cluster Overprovisioning, which is essentially a method to have an “active/passive” or as some call it a “hot/cold” cluster set up. It’s when you have worker nodes on standby. They’re created and spun up, but turned off and waiting. Although this option will cost an organization more money because you’ll have servers sitting there that you’ll pay for, the question becomes - what’s more important? Paying a few extra dollars for servers, or paying thousands or even millions if an app can’t properly scale?
There’s a project for this that may help you out. It’s not guaranteed that it’ll stay maintained or worked on, but perhaps you can fork it into your Git repo and manage it yourself for an organization: https://github.com/deliveryhero/helm-charts/tree/master/stable/cluster-overprovisioner
Thanks so much for reading! As always, I appreciate every single bit of support I get from the community.
If you’d like to see what else I’m working on, check out the links below: