Cédric Fabianski for Bearer

Posted on Jul 8, 2020 • Originally published at blog.bearer.sh

How Rust Lets Us Monitor 30k API calls/min

#rust #api #monitoring #architecture

At Bearer, we are a polyglot engineering team. Both in spoken languages and programming languages. Our stack is made up of services written in Node.js, Ruby, Elixir, and a handful of others in addition to all the languages our agent library supports. Like most teams, we balance using the right tool for the job with using the right tool for the time. Recently, we reached a limitation in one of our services that led us to transition that service from Node.js to Rust. This post goes into some of the details that caused the need to change languages, as well as some of the decisions we made along the way.

A bit of context

We are building a solution to help developers monitor their APIs. Every time a customer’s application calls an API, a log gets sent to us where we monitor and analyze it.

At the time of the issue, we were processing an average of 30k API calls per minute. That's a lot of API calls made across all our customers. We split the process into two key parts: Log ingestion and log processing.

We originally built the ingestion service in Node.js. It would receive the logs, communicate with an elixir service to check customer access rights, check rate limits using Redis, and then send the log to CloudWatch. There, it would trigger an event to tell our processing worker to take over.

We capture information about the API call, including the payloads (both the request and response) of every call sent from a user's application. These are currently limited to 1MB, but that is still a large amount of data to process. We send and process everything asynchronously and the goal is to make the information available to the end-user as fast as possible.

We hosted everything on AWS Fargate, a serverless management solution for Elastic Container Service (ECS), and set it to autoscale after 4000 req/min. Everything was great! Then, the invoice came 😱.

AWS invoices based on CloudWatch storage. The more you store, the more you pay.

Fortunately, we had a backup plan.

Kinesis to the rescue?

Instead of sending the logs to CloudWatch, we would useKinesis Firehose. Kinesis Firehose is basically a Kafka equivalent provided by AWS. It allows us to deliver a data stream in a reliable way to several destinations. With very few updates to our log processing worker, we were able to ingest logs from both CloudWatch and Kinesis Firehose. With this change, daily costs would drop to about 0.6% of what they were before.

The updated service now passed the log data through Kinesis and into s3 which triggers the worker to take over with the processing task. We rolled the change out and everything was back to normal... or we thought. Soon after, we started to notice some anomalies on our monitoring dashboard.

We were Garbage Collecting, a lot. Garbage collection (GC) is a way for some languages to automatically free up memory that is no longer in use. When that happens, the program pauses. This is known as a GC pause. The more writes you make to memory, the more garbage collection needs to happen and as a result, the pause time increases. For our service, these pauses were growing high enough that they caused the servers to restart and put stress on the CPU. When this happens, it can look like the server is down—because it temporarily is—and our customers started to see 5xx errors for roughly 6% of the logs our agent was trying to ingest.

Below we can see the pause time and pause frequency of the garbage collection:

In some instances, the pause time breached 4 seconds (as shown on the left), with up to 400 pauses per minute (as shown on the right) across our instances.

After some more research, we appeared to be another victim of amemory leak in the AWS Javascript SDK. We tried increasing the resource allocations to extreme amounts, like autoscaling after 1000 req/min, but nothing worked.

Possible solutions

With our backup plan no longer an option, we moved on to new solutions. First, we looked at those with the easiest transition path.

Elixir

As mentioned earlier, we are checking the customer access rights using an Elixir service. This service is private and only accessible from within our Virtual Private Cloud (VPC). We have never experienced any scalability issues with this service and most of the logic was already there. We could simply send the logs to Kinesis from within this service and skip over the Node.js service layer. We decided it was worth a try.

We developed the missing parts and tested it. It was better, but still not great. Our benchmarks showed that there were still high levels of Garbage Collecting, and we were still returning 5xx to our users when consuming the logs. At this point, the heavy load triggered a (now resolved) issue with one of our elixir dependencies.

Go

We considered Golang as well. It would have been a good candidate, but in the end, it is another Garbage Collected Language. While likely more efficient than our previous implementation, as we scale there is a high chance we'd run into similar problems. With these limitations in mind, we needed a better option.

Re-architecting with Rust at the core

In both our original implementation and our backup, the core issue remained the same: garbage collection. The solution was to move to a language with better memory management and no garbage collection. Enter Rust.

Rust isn't a garbage-collected language. Instead, it relies on a concept called ownership.

Ownership is Rust’s most unique feature, and it enables Rust to make memory safety guarantees without needing a garbage collector.

— The Rust Book

Ownership is the concept that often makes Rust difficult to learn and write, but also what makes it so well suited for situations like ours. Each value in Rust has a single owner variable and as a result a single point of allocation in memory. Once that variable goes out of scope the memory is immediately returned.

Since the code required to ingest the logs is quite small, we decided to give it a try. To test this we addressed the very thing that we had issues with—sending large amounts of data to Kinesis.

Our first benchmarks proved to be very successful.

From that point, we were pretty confident that Rust could be the answer and we decided to flesh out the prototype into a production-ready application.

Over the course of these experiments, rather than directly replacing the original Node.js service with Rust, we restructured much of the architecture surrounding log ingestion. The core of the new service is an Envoy proxy with the Rust application as a sidecar.

Now, when the Bearer Agent in a user's application sends log data to Bearer, it goes into the Envoy proxy. Envoy looks at the request and communicates with Redis to check things like rate limits, authorization details, and usage quotas. Next, the Rust application running alongside Envoy prepares the log data and passes it through Kinesis into an s3 bucket for storage. S3 then triggers our worker to fetch and process the data so Elastic Search can index it. At this point, our users can access the data in our dashboard.

What we found was that with fewer—and smaller—servers, we are able to process even more data without any of the earlier issues.

If we look at the latency numbers for the Node.js service, we can see peaks with an average response time nearing 1700ms.

With the Rust service implementation, the latency dropped to below 90ms, even at its highest peak, keeping the average response time below 40ms.

The original Node.js application used about 1.5GB of memory at any given time, while the CPUs ran at around 150% load. The new Rust service used about 100MB of memory and only 2.5% of CPU load.

Conclusion

As with most startups, we move fast. Sometimes the best solution at the time isn't the best solution forever. This was the case with Node.js. It allowed us to move forward, but as we grew we also outgrew it. As we started to handle more and more requests, we needed to make our infrastructure evolve to address the new requirements. While this process started with a fix that merely replaced Node.js with Rust, it led to a rethinking of our log ingestion service as a whole.

We still use a variety of languages throughout our stack, including Node.js, but will now consider Rust for new services where it makes sense.

Top comments (10)

Naing Aung Phyo • Jul 9 '20

Thanks for the detail explanation! It’s a great article. Thanks for sharing.
One thing I just notice is that Elixir is missing in the last overview architecture diagram but that is not a problem with this article.

Prabhu R • Jul 8 '20

Wonderful analysis and a great article. Very interesting to read how things were optimized and the solution improved.

Vicky Sundesha • Jul 9 '20

Great read!

Felix Terkhorn • Jul 8 '20

Nice results!

Stephen Leyva (He/Him) • Jul 9 '20

Are you able to give any insights with regards to ramping up devs with a new tech stack as different as rust?

Cédric Fabianski Bearer • Jul 10 '20

We are already using TypeScript quite heavily so we were all pretty used to it. The code of this application is fairly simple and the aim is to keep it that way for now. Finally, we had already done some PoC internally so we were already familiar with Rust :)

Max • Jul 13 '20

It's a big leap to go from TS to Rust :) Can you share the repo link, if that Rust code is open source?

gronkdaslayer • Jul 11 '20

You could also have used C/C++. Makes compact binaries, faster than just about anything besides assembly (difference isn't big either) and you you take care of memory allocations all by yourself. No GC, no crazy concept of ownership, etc.

Julio Daniel Reyes • Jul 18 '20 • Edited

With C/C++ is really easy to make a mistake with the memory allocations and shoot you in the foot.
With rust, your code won't compile if you have those problems and will even tell you what you're doing wrong and give you hints on how to solve it.
Also if you want to manage memory yourself, you can, but you will need to specifically say the code is unsafe, so if you make a mistake there, at least you will know where to look.

gronkdaslayer • Jul 19 '20

It's easy to make mistake if you're not careful, or don't know what you're doing.

Rust does help a lot but you can still do things like this:

fn main() {

let mut arr= [1,2,3];
for a in 0..10 {
arr[a] = arr[a] + 1;
}
}

Which will obviously panic. Essentially, stack overflows are still possible.
C/C++ puts the onus on the developer, which is not always a bad thing. If you ever get to write device drivers, you'll realize that you must remain extremely disciplined when writing your code, and it's not just a matter of language.

I actually like Rust. I'm seriously considering rewriting some code we have (in Java, which I dislike a lot) and use Rust to do so. It's mature enough, and it has a fair amount of crates that I can use and not worry about writing myself.

The problem with higher level languages, like Java, Node, Python, C# and Rust to a certain extent is that they make devs become lazy and complacent. They don't check return codes from functions, don't check for null pointers, and they let the GC take care of the memory for them. At least Rust doesn't have a GC, which is a very good thing.