DEV Community: Check Technologies

From Chaos to Control: The Importance of Tailored Autoscaling in Kubernetes

Jordi Been — Wed, 14 Aug 2024 08:58:35 +0000

Autoscaling in Kubernetes (k8s) is hard to get right. There are a lot of different autoscaling tools and flavors to choose from, while at the same time, each application demands a different set of resources. So, unfortunately, there's no 'one size fits all' implementation for autoscaling. A custom configuration that's tailor-made to the type of application you're hosting is often the best bet.

At Check, it took us a few iterations until we found the ideal configuration for our main API. The optimal solution required us to not only configure Kubernetes correctly but also tweak some settings in the k8s Deployment for it to work perfectly.

In this blog post, we'd like to share some of the challenges we faced and mistakes we made, so that you don't have to make them.

Autoscaling in Kubernetes: Choosing the Right Tool for Your Deployment

The right cluster-based autoscaling configuration is highly dependent on the type of Deployment you're hosting, and using the right tools for the job. There are several types of autoscaling tools to choose from when using Kubernetes.

Scaling Deployments

Horizontal Pod Autoscaling

A Horizontal Pod Autoscaler (HPA) dynamically adjusts the number of Pods in a Deployment to match changing workload demands. When traffic increases, the HPA scales up by deploying more Pods. Conversely, when demand decreases, it scales back down.

Vertical Pod Autoscaling

A Vertical Pod Autoscaler (VPA) automatically sets resource limits and requests based on usage patterns. This improves scheduling efficiency in Kubernetes by only allocating resources to nodes that have sufficient capacity. VPA can also downscale Pods that are over-requesting resources and upscale those that are under-requesting them.

KEDA

For more complex use cases, you can leverage the Kubernetes Event Driven Autoscaler (KEDA) to scale Deployments based on external events. This allows you to scale according to a Cron schedule, database queries (PostgreSQL, MySQL, MSSQL), or items in an event queue (Redis, RabbitMQ, Kafka).

Scaling Nodes

Cluster Autoscaling

A Cluster Autoscaler automatically manages Node scaling by adding Nodes when there are unschedulable Pods and removing them when possible.

Scaling Our Main API

The Unpredictable Nature of Our Traffic

As a shared mobility operator in The Netherlands, our main API's traffic is directly tied to the actual traffic in cities. It's not uncommon for us to see a significant spike in requests during rush hour - we're talking 100K requests per 5 minutes! On the other hand, weekdays at midnight are a different story, with only around 5-10K requests per 5 minutes. And then there are the weekends, which can be highly unpredictable due to weather conditions.

With such enormous differences in load, it's impossible to account for manually - especially when you factor in surprising spikes and peak loads. That's where k8s autoscaling comes in, saving our lives (and sanity!) by automatically scaling our resources to match demand.

Graph showing API traffic fluctuation in response to Dutch city traffic demand

Our Use Case: HPA + Cluster Autoscaler

For our use case, we found that a Horizontal Pod Autoscaler (HPA) combined with a Cluster Autoscaler was the perfect solution. During rush hour, the HPA scales up our Deployment to meet demand, spinning up more Pods as needed. When there aren't enough resources available on running Nodes, the Cluster Autoscaler kicks in, automatically adding new Nodes to the mix.

When traffic dies down, the HPA scales back down to a manageable level, after which the Cluster Autoscaler removes unnecessary Nodes. This automated scaling has been a game-changer for us, allowing us to focus on other important tasks while our infrastructure takes care of itself.

The Challenge of Unpredictable Deployments

As we delved into the world of Kubernetes autoscaling, we encountered a difficult challenge to overcome. Kubernetes' autoscaling tools depend on the retrieval of metrics. For resource metrics, this is the metrics.k8s.io API, provided by the Kubernetes Metrics Server.

We tried to understand our Deployment's behavior by analyzing its resource usage in Grafana. However, we soon realized that the amount of memory used by each Pod was fluctuating wildly. Because our Deployment's resource usage was behaving unpredictably, it made it very difficult to configure our resources correctly for autoscaling.

The Eye Opener

While developing one of our microservices built in FastAPI, we stumbled upon a crucial piece of documentation that highlighted the importance of handling replication at the cluster level rather than using using process managers like Gunicorn in each container.

“If you have a cluster of machines with Kubernetes [...] then you will probably want to handle replication at the cluster level instead of using a process manager (like Gunicorn with workers) in each container.”
"Replication - Number of Processes" (FastAPI documentation)

This was a real eye-opener for us!

Gunicorn Workers Causing Confusion

Check's main API was originally built in Pyramid, a Python web framework. Just like Django, Pyramid projects are typically served as a WSGI callable using a WSGI HTTP Server such as Gunicorn. Our legacy configuration had Gunicorn set to use 4 workers at all times.

On Gunicorn's documentation page, they strongly advise running multiple workers, recommending "(2 x $num_cores) + 1 as the number of workers to start off with" and seemingly incentivizing users to use as many workers as possible.

As we dug deeper into the issue, we realized that Gunicorn's load balancing across multiple worker processes was now confusing the Kubernetes Metrics Server API. Because a single Pod had 4 different workers actively processing requests, the resources it used would vary greatly according to the types of operations it was handling at the same time.

The Solution: A Single Process Per Pod

After this revelation, we moved to a single Gunicorn worker per Pod and saw immediate positive results.

Even though we now had to run close to 4 times as many Pods, we were able to dumb down the Deployment's resource configuration, ultimately causing a single Pod to run with significantly fewer resources too!

When analyzing the behavior of individual pods in Grafana after these changes, it revealed fewer memory spikes, with each Pod staying close to its average resource usage. Most importantly, our HPA started doing its job correctly!

Graph showing Pods spinning up in response to increased demand

Conclusion

Kubernetes autoscaling can be a complex beast, but with the right approach, it can bring significant benefits to your Deployment. As we navigated the world of Kubernetes autoscaling, we learned some valuable lessons.

Analyze and Understand

Thorough analysis is key when configuring cluster-based autoscaling with Kubernetes. By understanding your Deployment's resource usage patterns, you can set the right limits for individual Pods and ensure that your cluster autoscaler is working effectively.

Avoid Metrics-Server Confusion

When using WSGI tools like Gunicorn, be aware of their internal load-balancing features. These can confuse the metrics-server and lead to incorrect scaling decisions. To avoid this, configure your container image in such a way that it can be correctly scaled by the cluster instead.

Tailoring Your Solution

Most importantly, find the right combination of tools and resource configuration that suits your unique deployment needs. We've found how a HPA (Horizontal Pod Autoscaler) worked well for our main API deployment, while a Cron-based autoscaler was more suitable for scaling up our deployment that generates invoices on the first day of the month.

The Payoff: Reduced Costs and Improved Peace of Mind

By correctly configuring cluster-based autoscaling, we were able to reduce costs and improve peace of mind. Our Deployment now automatically scales according to traffic on our API, eliminating the need for manual server capacity reconfigurations.

Even though getting to a feasible situation isn't easy, it's well worth the time spent. And, as is often the case with technical concepts, you'll improve your feel for configuring these relatively new tools as you start using them more. With each new autoscaling setup, you'll gain more confidence in translating Grafana dashboards into HPA configurations, making it easier to configure autoscaling for your future deployments one step at a time.

How moving from Pandas to Polars made me write better code without writing better code

Paul Duvenage — Tue, 05 Mar 2024 12:56:44 +0000

In a scale-up like Check Technologies data not only grows, but it grows faster too. It was merely a matter of time before our data processes would run into resource limitations. Reason enough to find a more performant solution. Interestingly, the actual result, while impressive, was not the most interesting part of the solution.

In this article, we will discuss what I like to refer to as the Polarification of the Check data stack. More specifically: how we used Polars to solve a specific problem and then ended up completely replacing Pandas with Polars on Airflow.

We also highlight some challenges and learnings anyone can use should they consider the move over to Polars.

So what is Polars anyway?

Before going down this rabbit hole it is important to know the basics. If you have been using Python for a while chances are you have come across Pandas. An open-source dataframe library widely used in the Python ecosystem, most notably the data engineering/science and analytics world.

Below is a snippet from Pandas on how to read data from a CSV file into a dataframe.



import pandas as pd

df = pd.read_csv('data.csv')
df.head()

While Pandas is a fantastic library and it revolutionised the data analytics world it has several drawbacks. The original author of Pandas Wes McKinney famously gave a talk back in 2013 titled 10 Things I Hate About Pandas* where he highlighted the main design changes he would make had he rebuilt Pandas again today.

These things include:

Internals too far from “the metal”
No support for memory-mapped datasets
Poor performance in database and file ingest / export
Warty missing data support
Lack of transparency into memory use, RAM management
Weak support for categorical data
Complex groupby operations awkward and slow
Appending data to a DataFrame is tedious and very costly
Limited, non-extensible type metadata
Eager evaluation model, no query planning
“Slow”, limited multicore algorithms for large datasets

*11 things but who is counting...

The new kid on the block

In comes Polars: a brand new dataframe library, or how the author Ritchie Vink describes it... a query engine with a dataframe frontend. Polars is built on top of the Arrow memory format and is written in Rust, which is a modern performant and memory-safe systems programming language similar to C/C++.

Below is the Polars equivalent of reading data from a CSV file:



import polars as pl

df = pl.read_csv('data.csv')
df.head()

Polars addresses many of the issues with Pandas raised by the author and in doing so has resulted in blazingly fast performance and low memory consumption.

While Polars boasts many improvements, such as its intuitive design, ease of use, powerful expressions and so much more, my favourite and perhaps its core strength, is its Lazy API.



import polars as pl

lf = pl.scan_csv('data.csv')
df = lf.collect()
df.head()

In Pandas the steps in your code are executed eagerly, meaning they are executed as is, in sequential order. Lazy execution, on the other, hand means your code is given to the polars library, query planner to optimise and the results are only materialised when you call the collect() method.

This Lazy API allows the user to write code and let polars optimise it. These optimisations result in much faster execution times with less resource usage.

It is this Lazy API coupled with the power of Airflow where the magic happens.

Data Engineering at Check

At Check Technologies, we believe strongly in data-informed decision-making. A core part of this is our Check Data Platform. It allows us to perform various analyses, from marketing campaign performance, and shift demand forecasting to fleet-health monitoring, fraud detection, zonal & spatial analytics and so much more. The results from these analyses give us the incentives to improve existing features and create new ones.

A key component of this platform is Airflow, an open-source workflow management platform. Airflow is the industry standard workflow management tool in the world of data engineering and was chosen as it is widely used, actively developed & maintained and has a very large community. It also has very good support with established cloud providers, in the form of external packages*:

*An extensive list of Airflow Provider packages can be found here

Airflow is essential for our data pipelines and forms the backbone of our data infrastructure.

Pandas is very well integrated with Airflow. This is clear when looking at the Airflow Postgres integration, where returning a Pandas dataframe from a SQL query is predefined, see below:



from airflow.providers.postgres.hooks.postgres import PostgresHook

postgres = PostgresHook(postgres_conn_id="postgres")
df = postgres.get_pandas_df(sql=statement, parameters=parameters)
df.head()

Most of the Directed Acyclic Graphs (DAGs) at Check were built with Pandas. It was used to read data from various sources (databases, s3, etc.), clean the data, and then write the data back to various destinations, most notably our Datalake and Data Warehouse (DWH).

The Problem

As Check grew, so did the amount of data we were generating. One of our DAGs that processes AppsFlyer data (user attribution & app usage data) grew so large that on busy days we started getting the dreaded SIGKILL (Airflow is out of resources) error. Scaling up our Kubernetes (k8s) cluster to give Airflow more resources worked for a while but we knew this was not a long-term solution. A new approach was needed.

The AppsFlyer process was made up of 2 separate DAGs, one to parse the raw data and write it to the DWH and a second to take the parsed data and "sessionize" it, meaning grouping all user app interactions within a 15-minute window into unique sessions per user. This enables us to measure various user behaviour metrics and ultimately improve our product through learnings gained from these insights.

Due to the amount of the data, we had to write the parsed data to the DWH and then re-load it in smaller chunks to sessionize it, hence the 2-step process above. Not only did this create a duplication of data, but it also introduced a new problem: inaccurate sessionizing of data.

A better solution was needed.

Pandas works great, why switch to Polars?

I had been experimenting with Polars for a bit more than a year (since Polars 0.14.22 back in Oct 2022) when I faced this problem. Being written in Rust, it is known for being fast and having a much smaller memory footprint as compared to Pandas.

Additionally, the Polars Lazy API, which defers the query execution to the last minute, allowing for optimisations that can have significant performance advantages, could be the right approach to our problem.

Thus when faced with the out-of-memory error I thought it a perfect situation to try and use Polars in a production environment.

Before jumping headfirst into installing new dependencies on our production Airflow cluster I wanted to do some tests. Firstly to see if I could even do the necessary data parsing & pre-processing in Polars and secondly to determine what benefits we could gain.

The Experiment

The experiment was simple, take a single hour of the data and compare the eager original Pandas solution with the new Lazy API-powered Polars solution, measuring dataframe size and time taken. Yes this is crude and no this is not scientifically sound but it was merely to confirm the rumours about Polars.

The data, in the form of partitioned parquet files, consists of 103 columns, with one of these columns, event_value, containing JSON data in string format. It is this event_value column that contains most of the important data we need.

Unfortunately, the JSON in this column is not uniform and can also be null. Below is a snippet from this column.

We will not do a detailed code comparison between Pandas & Polars in this blog, however, I do want to highlight the differences between the eager and lazy approach from each library respectively using small snippets from the original solutions

A detailed comparison between Pandas and Polars solutions along with additional benefits that Polars offer will follow in the 2nd blog in this series.

The Pandas way

In Pandas we can parse this JSON to a dataframe using the json_normalize() function and then concatenate it to the original dataframe and continue the data transformation.

We first need to parse the string data to valid JSON using the json.loads() function which, unfortunately, does not take a Series as input.



TypeError: the JSON object must be str, bytes or bytearray, not Series

Therefore we have to use a lambda function and apply the string to JSON conversion to each row in the series and then we can convert the JSON to a dataframe.



df = pd.concat((pd.read_parquet(p) for p in paths))
df = df.reset_index().drop(["index"], axis=1)

df["event_value"] = df["event_value"].apply(
    lambda row: json.loads(row) if row != "" else ""
)

df_normalized = pd.json_normalize(df["event_value"])
df_events = pd.concat([df[["event_name"]], df_normalized], axis=1)
df_events = df_events.reset_index().drop(["index"], axis=1)

keep_columns = list(COLUMNS_INAPPS.keys())
df_final = df_events[keep_columns]

The Polars way

How does this solution compare to the Polars Lazy approach?

Well first thing to know is that Polars != Pandas, thus to solve the same problem, it is not as simple as changing the imports from Pandas to Polars. It requires a new way of thinking, one which I would argue is much more simple and intuitive.

The Polars solution below is only a small part of the original solution. The sections of this solution that are not important to this comparison have been replaced with ...:



lazy_df = pl.scan_parquet(paths)

df = (
    lazy_df.with_columns(
            [
            ...
            ...
            pl.col("event_value")
                .str.json_path_match(r"$.last_location_timestamp")
                .str.strptime(pl.Datetime, format="%Y-%m-%dT%H:%M:%S%Z")
                .cast(pl.Datetime)
                .alias("location_last_updated_at"),
            pl.col("event_value")
                .str.json_path_match(r"$.latitude")
                .cast(pl.Float64, strict=False)
                .alias("latitude"),
            pl.col("event_value")
                .str.json_path_match(r"$.longitude")
                .cast(pl.Float64, strict=False)
                .alias("longitude"),
            ...
            ...
            ...
            ]
        )
        .select(list(COLUMNS_INAPPS.keys()))
        ...
        ...
        .collect()
)

Glossing over a ton of detail, the main things to note are the use of the Polars Lazy API and the difference in syntax.

Step one we lazily load the dataset using the scan_parquet() function. Something important to highlight is that the scan_parquet() input path can contain wildcards * meaning you can lazily load multiple files at once.

We then define all the transformations we want to apply to this LazyFrame using the polars expression and then we call the collect() method.

The polars expressions are also very intuitive and result in clear and readable code. Here we use a string function json_path_match() to match the JSON we want, then we parse it and cast it to a datetime and finally assign the value to a new column with a name using the alias() method.

Once we call the collect() method, all our transformations will be passed to the Polars query planner which will optimise it and materialise the results to a dataframe.

The Experiment Results

Both solutions were repeatedly tested against various batches of the same data to ensure accurate results. Jupyter Notebooks %%time magic command was used to measure execution time. For Pandas the memory_usage() was used to measure size. For Polars, the estimated_size() function was used.

Below are the results from that test:

Pandas Time & Memory usage measurements

Polars Time & Memory usage measurements

Pandas vs Polars: Execution Time

Pandas vs Polars: Memory Usage

A 3.3x speed and approximately 3.1x memory improvement. Quite a big change, more than I was expecting. It confirmed the rumours about Polars but why the big change?

Well, it turns out that not only is Polars fast and low on resource usage, but it also helps you as a developer write better code. How does it do that? Simple, by not requiring you (the developer) to write good code.

What do I mean by this ridiculous statement?

In Pandas, to write the most optimised code you need to think of every optimisation yourself, from column selection, order of operation, and materialisations to vectorisation. This can be complex and it is easy to get it wrong. Polars on the other hand lets you focus on solving the business problem while it handles the optimisations. That is the power of the Lazy API.

Running the .show_graph() shows a plot of the query plan.

Polars Query Plan

In it, we see the below:

Column Selection Filter at Scan Level

which means Polars automatically applied a filter and only loaded the 19 columns we needed. Thus not reading in the remaining 84 columns greatly improves performance and reduces memory overhead.

While this optimisation (known as Predicate Pushdown) is also possible in Pandas, it is often overlooked. It's possible to write fast code in Pandas, but in Polars with the Lazy API fast is the default.

From this test, it is clear that Polars offers substantial performance and memory efficiency gains. Armed with these results I decided to migrate the AppsFlyer DAG from Pandas to Polars in production.

In Comes Airflow

Migrating 1 DAG
Having developed the Polars Lazy API solution for the test above, migrating the AppsFlyer DAG was mostly done. Only minor refactors to logging & notifications were required.

The new and improved DAG ran in production for a week while I closely monitored its performance. Not only did it work flawlessly but we also saw a massive speed improvement. This was fairly in line with the experiment results.

AppsFlyer DAG post Polars migration

With this positive outcome, I decided to migrate all remaining DAGs.

Migrating 100+ Dags
At Check, the Airflow DAGs can be grouped into 3 categories:

Extract and Load (EL, T is done in the DWH)
Complex Data Ingestion, Transformations or Parsing
Other (Spatial Computation, Archival, Reporting, etc)

The majority of the dags fall into the EL group and share similar logic. They all extract data from a database and write it to our s3 datalake using the parquet file format. From there, the data is either loaded into our DWH or used by another downstream process. The remaining DAGs all have unique data sources but still write to the same s3 datalake.

Our initial Airflow setup abstracted away this shared logic into helper functions that can be re-used by all DAGs.

Below are the Pandas implementations of the Read get_df_for_psql()




def get_df_for_psql(stmt: str, params: dict = {}) -> pd.DataFrame:
    log.info("Setting up Postgres hook")
    postgres = PostgresHook(postgres_conn_id="postgres")
    dataframe = postgres.get_pandas_df(sql=stmt, parameters=params)
    return dataframe

and Write wrangle_to_s3() helper functions.



def wrangle_to_s3(dataframe: DataFrame, dtype: dict, key: str) -> str:
    s3_location = f"s3://{LAKE_BUCKET}/datalake/{key}"
    wr.s3.to_parquet(
        df=dataframe,
        path=s3_location,
        dtype=dtype,
        sanitize_columns=True,
        pyarrow_additional_kwargs={
            "coerce_timestamps": "ms",
            "allow_truncated_timestamps": True,
            "use_deprecated_int96_timestamps": False,
        },
    )
    return s3_location

The first step in the migration was to refactor these helper functions to the Polars equivalents. This was very straightforward, see below:



def get_df_for_psql(stmt: str, params: dict = {}) -> pl.DataFrame:
    log.info("Setting up Postgres hook")
    postgres = PostgresHook(postgres_conn_id="postgres")
    uri = postgres.get_uri()
    dataframe = pl.read_database_uri(query=stmt, uri=uri)
    return dataframe



def wrangle_to_s3(dataframe: DataFrame, dtype: dict, key: str) -> str:
    s3_location = f"s3://{LAKE_BUCKET}/datalake/{key}"
    fs = s3fs.S3FileSystem() 
    with fs.open(s3_location, mode='wb') as f:
        dataframe.write_parquet(f)
    return s3_location

To my surprise, for some DAGs, it was really this simple. Top level import changes and 2 small functions refactored and then you have migrated from Pandas to Polars.

Other DAGs required a bit more work. Most of the migration refactoring was centred around transformation steps using the Polars Expressions. While this took some time to get used to, the resulting code was much more readable and maintainable.

The migration however did have a couple of challenges. Testing these changes locally was essential to ensure no downtime in services.

Using an M1 Macbook means developing locally on Airflow requires aarch64 support. While Polars is very well supported on all platforms, connector-x a dependency for reading data from a database, at the time of writing this blog, still does not have pre-built wheels for aarch64.

This was originally a blocker, however, we managed to set up a multi-stage Docker build to build from source. Here is the Github issue where we, along with community members, managed to solve it.

Being able to test Polars locally on Airflow gave us the confidence to proceed to use it in production. For us, having the benefits of Polars and having to, only for local testing purposes, build a beta dependency from source was worth the additional effort.

Polars does support multiple database connection drivers (ADBC, ODBC, SQLAlchemy etc), however, connector-x is noticeably faster. It is worth pointing out that there is an open PR to add support for aarch64 to connector-x, which I expect to fix this issue any day now. In addition, having spoken to the author and core maintainers of Polars, I know big changes are coming. Especially once the ADBC project reaches maturity.

Secondly while developing the original Lazy API solution I encountered an error that I could not resolve. Seeking help in the Polars discord channel, the author suggested a fix, which worked, and requested I log a Github issue. Having logged the issue I was amazed to see it resolved in a matter of hours and in the next release.

Polars is a new library and is in very active development with frequent release cycles (often 1-2 per week). Additionally, the maintainers are super responsive and helpful, thus any issues you might have are quickly resolved.

Results

Post migration we observed that almost all DAGs gained roughly a 2x speed improvement.

DAG execution time reduced after Polars release

The original goal, fixing the out-of-memory error, was not only realised but we were also able to combine 2 separate processes into one. Thereby simplifying the process, avoiding data duplication and improving the sessionizing accuracy.

2 Processes combined using Polars LazyAPI

We also observed a healthy drop in overall resource usage.

Reduction in overall resource utilisation after Polars migration

This allowed us to scale down (avoid constantly scaling up) our services and resulted in a much more stable platform. In addition, this scaling down of resources also resulted in an impressive 25% cost saving on our cloud provider bill for our data stack.

Conclusions

Finally, the Polarification of Check's data stack is complete. With over 100 DAGs currently running in production, the migration took less than 2 weeks (1 sprint) and with no disruption to normal operations.

Having migrated all the DAGs from Pandas to Polars and observing the benefits, it is clear that it was the right decision. Not only did we see a reduction in resource usage but many DAGs gained a speed improvement. This decreases the time from raw data to insights and makes the Check Data Platform more agile.

Polars does the heavy lifting for you. You can focus on solving the problem at hand and Polars will take care of all the optimisations.

With these optimisations, and being written in Rust, Polars uses fewer resources and your cloud infrastructure and wallet will love you.

While migrating from Pandas to Polars wasn't without its challenges it was surprisingly easy. As Polars matures I would not be surprised to see it integrated natively into providers packages similar to Pandas. This will most certainly be a benefit to the whole data engineering industry.

The responsiveness of the maintainers and the supportive community added to our decision to migrate.

For us, the results speak for themselves. Polars not only solved our initial problem but opened the door to new possibilities. We are excited to use Polars on future data engineering projects.

Anyone not yet considering Polars can hopefully learn from our experience and take the leap.

How having one million API requests an hour pointed us into building a Rust microservice that processes fleet updates

Jordi Been — Tue, 30 Jan 2024 11:05:43 +0000

EVERYONE IN THE CITY, EVERYWHERE IN 15 MINUTES.

That's our motto at Check Technologies, a shared mobility operator in The Netherlands where users can rent e-mopeds, e-kickscooters or e-cars. When founding the company, Check decided to hire a team of engineers to build a custom platform, as opposed to using an off-the-shelf SaaS product.

This team of just 6 engineers are responsible for not only building, maintaining and improving the Check application used by over 800K users today, but also building upon internal tooling, performing data analyses and taking care of hosting of the platform.

From launching back in February 2020 until now, the company has seen significant growth in users, trips and vehicles.

Date	Amount of Vehicles
1 Jan 2021	1170
1 Jan 2022	3146
1 Jan 2023	8160

With this new blog, we (Check's engineering team) would like to share some of the technical challenges we had to overcome, the solutions we came up with, and the insights we've gained along the way. Expect write-ups from different engineers within the team who will share their thoughts on topics related to their domain, such as app development, cloud infrastructure and data engineering.

First up: How having one million API requests an hour pointed us into building a Rust microservice that processes fleet updates

A microservice, why?

Up until the start of 2022, the Check backend was hosted as an Elastic Beanstalk web application. Even though this AWS service proved to be reliable for getting us off the ground initially, we had run into its limits multiple times. Getting the right autoscaling configuration was rough, the costs were growing month after month and most importantly: it's not made for hosting a microservice infrastructure.

By making the move to Kubernetes starting that year, we paved the way for building smaller applications that can run (and scale) independently. Microservices, as you'd call it.

Webhooks

Over 1 million requests every hour
It all started years back, during a moment of celebration. We reached the impressive number of 1 million requests an hour. A moment worth cheering, yet also a moment in which we discovered something remarkable. We analysed the distribution of these requests, and concluded that over 60% of these requests were webhooks.

Webhooks
At Check, users can rent different types of vehicles in our app. During the time of this project, we had integrated the mopeds NIU N1S and Segway E110S, as well as the kickscooter Segway Ninebot MAX. These providers have both developed APIs for executing commands on their vehicles (turning it on and off) and receiving information about their vehicles (location, mileage, battery percentage). Our backend exposed an API route, that these providers used to POST vehicle's information to, in the form of a webhook.

For processing a moped's location, this API route processed it as follows;

Receive updates (Eg: moped [x] is now at [coordinates])
Store raw location and time in the database
Update the corresponding vehicle's location in the database
Send a successful response

Even though this request was set up as small as possible, it still took our backend around 250ms to process these requests. We had around 5000 vehicles back then, so with each vehicle sending us an update every 5 seconds when turned on, it took our backend quite some time to process these updates, all while having to process app users requests as well.

Third-party bombs
Our platform heavily depends on this integration for processing a provider's constant stream of vehicle updates. Even though this integration worked flawlessly most of the time, every once in a while one of the providers would have a small hiccup on their side. These hiccups not only resulted in not receiving their vehicle's updates for a few minutes, but it also meant that we were about to receive something that we internally referred to as 'a bomb' -> a big batch of vehicle updates containing everything that happened during one of these hiccups. In short: we would sometimes get half an hour of vehicle updates worth within a few seconds.

Depending on how big they were, these 'bombs' were notorious for causing instability within our platform. Our backend was unable to process both the user traffic and all these vehicle updates at the same time.

Off to better things
Longing for a situation where user and fleet update traffic would no longer be processed by the same service, and given that over 60% of incoming traffic during peak hours were webhooks, we decided that this would be the perfect chance for putting our new Kubernetes infrastructure to the test, and so we started building our first microservice.

Rust

Due to the sheer volume and relative simplicity of these webhook requests, a clear input (webhook) and output (200 OK status-code), we decided to build a proof of concept using Rust. Rust is a low-level language, primarily known for being strongly type, memory safe and offering great performance.

The stack
The proof-of-concept was built using the following Cargo crates:

rocket
serde
tokio
postgres
redis

The project compiles into two separate binaries, one for the API service, and one for the consumer service.

Fleet webhook API service
The first component of our Rust microservice is the 'Fleet webhook API'. This service exposes a rocket API layer with an endpoint for each provider to send their vehicle updates to.

Once this service receives a webhook, it inserts the raw body to a Redis queue and immediately responds with a '200 OK'. By not having to read or write to the database during this request, we were able to shave off more than 10x the response time. These little requests now only take a max of 25ms!

Fleet consumer service
The second component of our Rust microservice is the 'Fleet consumer'. This binary is connected to the same Redis queue, and is responsible for actually processing the updates.

It updates the moped in the application's database and stores a raw entry of it to a TimeScale database (a PostgreSQL database specifically designed to handle large sets of event data).

Separate binaries
The great thing about this set up is that we're able to independently scale both of these components. Because the consumers that process the updates are doing most of the heavy lifting, we usually run around three times as many Kubernetes Pods of them, as opposed to the webhook API.

New situation

Dealing with bombs
This now means that user traffic, as well as back-office traffic, is handled independently from fleet update traffic. When a third party has a hiccup, resulting in loads of fleet updates to process at once, our users will not experience any latency in their apps because even though the microservice will be busy processing these updates, the main API is still sailing smoothly.

Extensibility
This microservice was built prior to when Check released e-cars on its platform. However, when integrating e-cars into the platform using Invers' Cloudboxx, we were able to swiftly implement their AMQP functionality to process live information about our cars, proving the extensibility of the service.

Independent scaling
We're able to scale our main API independently from this fleet update microservice. With 60% of our traffic being fleet updates, we were able to significantly downscale our main API. Additionally, Rust's focus on performance and minimal resource usage allowed us to reduce costs in the meantime.

Conclusions

At Check Technologies, our engineering efforts go beyond just adding new features for our users; we actively strive to enhance efficiency, scalability and resilience of the platform. By transitioning to Kubernetes and developing our first microservice in Rust, we were able to overcome the challenges associated with a high volume of webhook requests, ensuring a smooth experience for our users.

The adoption of a microservices architecture, in combination with our first Rust-based solution, has revolutionised the way we process fleet updates. The 'Fleet webhook API' and 'Fleet consumer services', operating as independent components, enable independent scaling, reducing latency and enhancing overall system stability. We have effectively mitigated the impact of 'third-party bombs,' allowing our main API to sail smoothly even during peak traffic hours.

As we look back on the progress achieved in our technology stack, we are enthusiastic about the opportunities that lie ahead. At Check Technologies, we are dedicated to raising the bar, finding innovative solutions, and ensuring that our platform continues to be at the forefront of shared mobility technology. The journey has been challenging, but the success of our fleet microservice marks a high note, laying the foundation for sustained growth and ongoing technological advancements in the field of shared mobility.