For cost reasons, it is often neither feasible nor desirable to assign enough resources to a deployment for it to be able to handle peak loads at all times. Therefore, we typically scale applications up and down based on the load they are currently facing. This usually involves a minimum number of instances deployed at any time, even if there is no load. This minimum can force us to keep more worker nodes in our Kubernetes cluster than necessary as the instances have an assigned resource budget.
In this blog post, we will take a look at how to reduce the minimum amount of deployed instances to zero and discuss which kinds of applications benefit from that the most.
Use Cases
When you scale applications to zero, you have to bear a few things in mind. First, as the inbuilt “HorizontalPodAutoscaler” can only scale to a minimum of one replica, another scaler is required. This adds a bit of overhead to the cluster. In this post, we will focus on using the Kubernetes Event-Driven Autoscaler (KEDA) for that purpose.
Second, we often assume that there is at least one replica when we design our systems. This means that, typically, certain metrics are not collected when there are no instances. We need a metric which is available to the autoscaler while the deployment is scaled to zero, otherwise we will not be able to scale back up. Also, how are incoming connections going to be handled when there is no instance running?
For some cases, this is easy to solve. Applications that follow a producer-consumer pattern make it easy to scale the consumer to zero. For example, when an application is consuming messages from a message queue, we are able to take the current length of the queue as a metric for our scaling. If the queue is empty, there is no need to have a worker consuming the queue, and as soon as there is at least one entry in the queue, we can deploy a consumer.
Other cases become a little harder to solve. With a web application, we don’t have a good metric available when the application is scaled to zero. Usually, we would want to base the scaling on the number of requests in a given time frame or the average response time.
We could use a metric from an external source if available or implement another component to monitor incoming requests.
The need for such a component becomes even more apparent when we consider the requests itself. We don’t want to miss requests when the application is currently scaled to zero. Therefore, we have to keep track of incoming connections and deploy our application in case it is currently scaled to zero replicas when a request appears.
The KEDA HTTP add-on is a component of KEDA that allows us to base scaling on the number of incoming HTTP requests. It contains a proxy that is inserted between ingress controller and application in Kubernetes. This proxy buffers incoming requests and reports the number of pending requests as a metric to KEDA.
An example configuration for KEDA with the HTTP add-on may look like this:
apiVersion: http.keda.sh/v1alpha1
kind: HTTPScaledObject
metadata:
name: frontend-scaler
namespace: demoapplication
labels:
deploymentName: frontend
spec:
host: "demo.example.com"
scaleTargetRef:
deployment: demoapplication
service: frontend
port: 8080
targetPendingRequests: 10
replicas:
min: 0
max: 100
This configuration will scale the frontend deployment to zero when there are no pending requests, and scale up to a maximum of 100 instances if required to keep the queue at 10 pending requests. While this works well for applications with a gradually increasing load, you might run into difficulties when applying this to applications with a sharp increase in requests. In this case, it is possible that the proxy turns into a bottleneck.
Another class of applications that is easy to scale to zero are those that are only used during certain time periods. For example, a learning platform used by a school has a lot more traffic during daytime on days that are not holidays than at night. Furthermore, such applications are known to have a nonnegligible start-up time, which could lead to request time-outs when scaling up from zero on request.
The time dependency can be used to preemptively scale up the application shortly before a lot of traffic is expected, preventing users from running into an unreachable application during normal hours. At night, the application could be scaled to zero, and be scaled up using the proxy if demand arises. This can lead to a time-out during the first request, but the savings may be worth it, especially if this platform was hosted by a service provider for multiple schools.
Conclusion
Of course, scaling to zero is not always a viable option as some applications don’t benefit from it. This includes applications with a constant base load or only short time frames without traffic. In these cases, the overhead of scaling from one to zero and back again might easily outweigh the benefits of scaling to zero.
In short, most applications running on Kubernetes can be scaled to zero with a bit of effort, which can reduce both your infrastructure bill and your carbon footprint.
Top comments (1)
does this work with traefik?