DEV Community: Jordi Been

From Chaos to Control: The Importance of Tailored Autoscaling in Kubernetes

Jordi Been — Wed, 14 Aug 2024 08:58:35 +0000

Autoscaling in Kubernetes (k8s) is hard to get right. There are a lot of different autoscaling tools and flavors to choose from, while at the same time, each application demands a different set of resources. So, unfortunately, there's no 'one size fits all' implementation for autoscaling. A custom configuration that's tailor-made to the type of application you're hosting is often the best bet.

At Check, it took us a few iterations until we found the ideal configuration for our main API. The optimal solution required us to not only configure Kubernetes correctly but also tweak some settings in the k8s Deployment for it to work perfectly.

In this blog post, we'd like to share some of the challenges we faced and mistakes we made, so that you don't have to make them.

Autoscaling in Kubernetes: Choosing the Right Tool for Your Deployment

The right cluster-based autoscaling configuration is highly dependent on the type of Deployment you're hosting, and using the right tools for the job. There are several types of autoscaling tools to choose from when using Kubernetes.

Scaling Deployments

Horizontal Pod Autoscaling

A Horizontal Pod Autoscaler (HPA) dynamically adjusts the number of Pods in a Deployment to match changing workload demands. When traffic increases, the HPA scales up by deploying more Pods. Conversely, when demand decreases, it scales back down.

Vertical Pod Autoscaling

A Vertical Pod Autoscaler (VPA) automatically sets resource limits and requests based on usage patterns. This improves scheduling efficiency in Kubernetes by only allocating resources to nodes that have sufficient capacity. VPA can also downscale Pods that are over-requesting resources and upscale those that are under-requesting them.

KEDA

For more complex use cases, you can leverage the Kubernetes Event Driven Autoscaler (KEDA) to scale Deployments based on external events. This allows you to scale according to a Cron schedule, database queries (PostgreSQL, MySQL, MSSQL), or items in an event queue (Redis, RabbitMQ, Kafka).

Scaling Nodes

Cluster Autoscaling

A Cluster Autoscaler automatically manages Node scaling by adding Nodes when there are unschedulable Pods and removing them when possible.

Scaling Our Main API

The Unpredictable Nature of Our Traffic

As a shared mobility operator in The Netherlands, our main API's traffic is directly tied to the actual traffic in cities. It's not uncommon for us to see a significant spike in requests during rush hour - we're talking 100K requests per 5 minutes! On the other hand, weekdays at midnight are a different story, with only around 5-10K requests per 5 minutes. And then there are the weekends, which can be highly unpredictable due to weather conditions.

With such enormous differences in load, it's impossible to account for manually - especially when you factor in surprising spikes and peak loads. That's where k8s autoscaling comes in, saving our lives (and sanity!) by automatically scaling our resources to match demand.

Graph showing API traffic fluctuation in response to Dutch city traffic demand

Our Use Case: HPA + Cluster Autoscaler

For our use case, we found that a Horizontal Pod Autoscaler (HPA) combined with a Cluster Autoscaler was the perfect solution. During rush hour, the HPA scales up our Deployment to meet demand, spinning up more Pods as needed. When there aren't enough resources available on running Nodes, the Cluster Autoscaler kicks in, automatically adding new Nodes to the mix.

When traffic dies down, the HPA scales back down to a manageable level, after which the Cluster Autoscaler removes unnecessary Nodes. This automated scaling has been a game-changer for us, allowing us to focus on other important tasks while our infrastructure takes care of itself.

The Challenge of Unpredictable Deployments

As we delved into the world of Kubernetes autoscaling, we encountered a difficult challenge to overcome. Kubernetes' autoscaling tools depend on the retrieval of metrics. For resource metrics, this is the metrics.k8s.io API, provided by the Kubernetes Metrics Server.

We tried to understand our Deployment's behavior by analyzing its resource usage in Grafana. However, we soon realized that the amount of memory used by each Pod was fluctuating wildly. Because our Deployment's resource usage was behaving unpredictably, it made it very difficult to configure our resources correctly for autoscaling.

The Eye Opener

While developing one of our microservices built in FastAPI, we stumbled upon a crucial piece of documentation that highlighted the importance of handling replication at the cluster level rather than using using process managers like Gunicorn in each container.

“If you have a cluster of machines with Kubernetes [...] then you will probably want to handle replication at the cluster level instead of using a process manager (like Gunicorn with workers) in each container.”
"Replication - Number of Processes" (FastAPI documentation)

This was a real eye-opener for us!

Gunicorn Workers Causing Confusion

Check's main API was originally built in Pyramid, a Python web framework. Just like Django, Pyramid projects are typically served as a WSGI callable using a WSGI HTTP Server such as Gunicorn. Our legacy configuration had Gunicorn set to use 4 workers at all times.

On Gunicorn's documentation page, they strongly advise running multiple workers, recommending "(2 x $num_cores) + 1 as the number of workers to start off with" and seemingly incentivizing users to use as many workers as possible.

As we dug deeper into the issue, we realized that Gunicorn's load balancing across multiple worker processes was now confusing the Kubernetes Metrics Server API. Because a single Pod had 4 different workers actively processing requests, the resources it used would vary greatly according to the types of operations it was handling at the same time.

The Solution: A Single Process Per Pod

After this revelation, we moved to a single Gunicorn worker per Pod and saw immediate positive results.

Even though we now had to run close to 4 times as many Pods, we were able to dumb down the Deployment's resource configuration, ultimately causing a single Pod to run with significantly fewer resources too!

When analyzing the behavior of individual pods in Grafana after these changes, it revealed fewer memory spikes, with each Pod staying close to its average resource usage. Most importantly, our HPA started doing its job correctly!

Graph showing Pods spinning up in response to increased demand

Conclusion

Kubernetes autoscaling can be a complex beast, but with the right approach, it can bring significant benefits to your Deployment. As we navigated the world of Kubernetes autoscaling, we learned some valuable lessons.

Analyze and Understand

Thorough analysis is key when configuring cluster-based autoscaling with Kubernetes. By understanding your Deployment's resource usage patterns, you can set the right limits for individual Pods and ensure that your cluster autoscaler is working effectively.

Avoid Metrics-Server Confusion

When using WSGI tools like Gunicorn, be aware of their internal load-balancing features. These can confuse the metrics-server and lead to incorrect scaling decisions. To avoid this, configure your container image in such a way that it can be correctly scaled by the cluster instead.

Tailoring Your Solution

Most importantly, find the right combination of tools and resource configuration that suits your unique deployment needs. We've found how a HPA (Horizontal Pod Autoscaler) worked well for our main API deployment, while a Cron-based autoscaler was more suitable for scaling up our deployment that generates invoices on the first day of the month.

The Payoff: Reduced Costs and Improved Peace of Mind

By correctly configuring cluster-based autoscaling, we were able to reduce costs and improve peace of mind. Our Deployment now automatically scales according to traffic on our API, eliminating the need for manual server capacity reconfigurations.

Even though getting to a feasible situation isn't easy, it's well worth the time spent. And, as is often the case with technical concepts, you'll improve your feel for configuring these relatively new tools as you start using them more. With each new autoscaling setup, you'll gain more confidence in translating Grafana dashboards into HPA configurations, making it easier to configure autoscaling for your future deployments one step at a time.

How having one million API requests an hour pointed us into building a Rust microservice that processes fleet updates

Jordi Been — Tue, 30 Jan 2024 11:05:43 +0000

EVERYONE IN THE CITY, EVERYWHERE IN 15 MINUTES.

That's our motto at Check Technologies, a shared mobility operator in The Netherlands where users can rent e-mopeds, e-kickscooters or e-cars. When founding the company, Check decided to hire a team of engineers to build a custom platform, as opposed to using an off-the-shelf SaaS product.

This team of just 6 engineers are responsible for not only building, maintaining and improving the Check application used by over 800K users today, but also building upon internal tooling, performing data analyses and taking care of hosting of the platform.

From launching back in February 2020 until now, the company has seen significant growth in users, trips and vehicles.

Date	Amount of Vehicles
1 Jan 2021	1170
1 Jan 2022	3146
1 Jan 2023	8160

With this new blog, we (Check's engineering team) would like to share some of the technical challenges we had to overcome, the solutions we came up with, and the insights we've gained along the way. Expect write-ups from different engineers within the team who will share their thoughts on topics related to their domain, such as app development, cloud infrastructure and data engineering.

First up: How having one million API requests an hour pointed us into building a Rust microservice that processes fleet updates

A microservice, why?

Up until the start of 2022, the Check backend was hosted as an Elastic Beanstalk web application. Even though this AWS service proved to be reliable for getting us off the ground initially, we had run into its limits multiple times. Getting the right autoscaling configuration was rough, the costs were growing month after month and most importantly: it's not made for hosting a microservice infrastructure.

By making the move to Kubernetes starting that year, we paved the way for building smaller applications that can run (and scale) independently. Microservices, as you'd call it.

Webhooks

Over 1 million requests every hour
It all started years back, during a moment of celebration. We reached the impressive number of 1 million requests an hour. A moment worth cheering, yet also a moment in which we discovered something remarkable. We analysed the distribution of these requests, and concluded that over 60% of these requests were webhooks.

Webhooks
At Check, users can rent different types of vehicles in our app. During the time of this project, we had integrated the mopeds NIU N1S and Segway E110S, as well as the kickscooter Segway Ninebot MAX. These providers have both developed APIs for executing commands on their vehicles (turning it on and off) and receiving information about their vehicles (location, mileage, battery percentage). Our backend exposed an API route, that these providers used to POST vehicle's information to, in the form of a webhook.

For processing a moped's location, this API route processed it as follows;

Receive updates (Eg: moped [x] is now at [coordinates])
Store raw location and time in the database
Update the corresponding vehicle's location in the database
Send a successful response

Even though this request was set up as small as possible, it still took our backend around 250ms to process these requests. We had around 5000 vehicles back then, so with each vehicle sending us an update every 5 seconds when turned on, it took our backend quite some time to process these updates, all while having to process app users requests as well.

Third-party bombs
Our platform heavily depends on this integration for processing a provider's constant stream of vehicle updates. Even though this integration worked flawlessly most of the time, every once in a while one of the providers would have a small hiccup on their side. These hiccups not only resulted in not receiving their vehicle's updates for a few minutes, but it also meant that we were about to receive something that we internally referred to as 'a bomb' -> a big batch of vehicle updates containing everything that happened during one of these hiccups. In short: we would sometimes get half an hour of vehicle updates worth within a few seconds.

Depending on how big they were, these 'bombs' were notorious for causing instability within our platform. Our backend was unable to process both the user traffic and all these vehicle updates at the same time.

Off to better things
Longing for a situation where user and fleet update traffic would no longer be processed by the same service, and given that over 60% of incoming traffic during peak hours were webhooks, we decided that this would be the perfect chance for putting our new Kubernetes infrastructure to the test, and so we started building our first microservice.

Rust

Due to the sheer volume and relative simplicity of these webhook requests, a clear input (webhook) and output (200 OK status-code), we decided to build a proof of concept using Rust. Rust is a low-level language, primarily known for being strongly type, memory safe and offering great performance.

The stack
The proof-of-concept was built using the following Cargo crates:

rocket
serde
tokio
postgres
redis

The project compiles into two separate binaries, one for the API service, and one for the consumer service.

Fleet webhook API service
The first component of our Rust microservice is the 'Fleet webhook API'. This service exposes a rocket API layer with an endpoint for each provider to send their vehicle updates to.

Once this service receives a webhook, it inserts the raw body to a Redis queue and immediately responds with a '200 OK'. By not having to read or write to the database during this request, we were able to shave off more than 10x the response time. These little requests now only take a max of 25ms!

Fleet consumer service
The second component of our Rust microservice is the 'Fleet consumer'. This binary is connected to the same Redis queue, and is responsible for actually processing the updates.

It updates the moped in the application's database and stores a raw entry of it to a TimeScale database (a PostgreSQL database specifically designed to handle large sets of event data).

Separate binaries
The great thing about this set up is that we're able to independently scale both of these components. Because the consumers that process the updates are doing most of the heavy lifting, we usually run around three times as many Kubernetes Pods of them, as opposed to the webhook API.

New situation

Dealing with bombs
This now means that user traffic, as well as back-office traffic, is handled independently from fleet update traffic. When a third party has a hiccup, resulting in loads of fleet updates to process at once, our users will not experience any latency in their apps because even though the microservice will be busy processing these updates, the main API is still sailing smoothly.

Extensibility
This microservice was built prior to when Check released e-cars on its platform. However, when integrating e-cars into the platform using Invers' Cloudboxx, we were able to swiftly implement their AMQP functionality to process live information about our cars, proving the extensibility of the service.

Independent scaling
We're able to scale our main API independently from this fleet update microservice. With 60% of our traffic being fleet updates, we were able to significantly downscale our main API. Additionally, Rust's focus on performance and minimal resource usage allowed us to reduce costs in the meantime.

Conclusions

At Check Technologies, our engineering efforts go beyond just adding new features for our users; we actively strive to enhance efficiency, scalability and resilience of the platform. By transitioning to Kubernetes and developing our first microservice in Rust, we were able to overcome the challenges associated with a high volume of webhook requests, ensuring a smooth experience for our users.

The adoption of a microservices architecture, in combination with our first Rust-based solution, has revolutionised the way we process fleet updates. The 'Fleet webhook API' and 'Fleet consumer services', operating as independent components, enable independent scaling, reducing latency and enhancing overall system stability. We have effectively mitigated the impact of 'third-party bombs,' allowing our main API to sail smoothly even during peak traffic hours.

As we look back on the progress achieved in our technology stack, we are enthusiastic about the opportunities that lie ahead. At Check Technologies, we are dedicated to raising the bar, finding innovative solutions, and ensuring that our platform continues to be at the forefront of shared mobility technology. The journey has been challenging, but the success of our fleet microservice marks a high note, laying the foundation for sustained growth and ongoing technological advancements in the field of shared mobility.