David Tchekachev for IVAO

Posted on Jul 20 • Edited on Jul 22

Why We Had to Switch from NestJS to Rust Over One Hidden (and Costly) Setting

#nestjs #rust #webdev #api

TL;DR

We partially migrated to Rust because NestJS's compression middleware had a default configuration at max level, which noticeably impacted our performance - something that took months to identify and fix.

A bit of context

As presented in our previous post, IVAO has a website displaying live data traffic to thousands of visitors at the same time: Webeye

The website operates fairly simply, it fetches the data every 15 seconds from 3 different endpoints:

/now/pilots: 3 MB
/now/atcs: 3 MB
/now/observers: 1 MB

We also have specific endpoints which are fetched when you click on a specific user:

/sessions/:id: 0.5 MB
/sessions/:id/tracks: 0-5 MB
/whazzup: 5 MB

Each endpoint used to take around 2 seconds to reply, which was unacceptably slow !! We understand NestJS (based on fastify) isn't the go-to framework for low-latency APIs, but it's the easiest to maintain across the 20+ APIs we have, and it's easy to find volunteers with the skills to do so.

Let's try to find the culprit !

First, we checked the NestJS best-practices for performance and confirmed that we were already using fastify instead of express. And we already had compression enabled to reduce the response payload size.

Also, at that time (which isn't the case when I'm writing this article), the authentication & authorization part was handled by Kong, our Kubernetes proxy, so the API would only have to serve the data and not waste compute time on it.

Caching

Most of the time, when an API is slow, the first attempt to speed it up is to cache it.

We have tried to implement caching at CloudFlare level, but it didn't work as all our endpoints are authenticated with an API Key or Bearer token, and Cloudflare doesn't cache those resources (which makes sense).

We have also tried implementing Redis caching which seemed to work well but the first person to make the request would still have to wait those long seconds. While it’s technically possible to pre-populate the cache with a background task, while the API would only be serving it, we decided it would be an ugly and overcomplicated trick that would create new issues.

Digging into the code

While debugging, we leveraged JavaScript’s console.time feature to pinpoint the function taking so much time. Our conclusion was that the request was received and went through all the middlewares pretty quickly, the database query was already pretty fast, but the response was taking most of that time.

There, our first guess that it was related to the serialization of our response which came from Sequelize into JSON.

We have tried to implement Nestia, which is a NestJS variant that should have 200x faster serialization. After some very long nights fighting with it without any improvement, we came to the conclusion that we were either doing something completely wrong, or the issue wasn't in the serialization process...

Out of ideas

At some point, we ran out of ideas, after trying to optimize our NestJS API as much as possible.
As some developers suggested: "Let's try the most state-of-the-art solution for low-latency APIs !"

Migration to Rust (Actix)

Although I wasn't a big fan of introducing a new language into our tech stack (and still am), we decided to give Actix, a Rust framework for APIs, a go; just to see what we could achieve and if the issue was, indeed, coming from NodeJS.

Although we struggled a bit — none of us had experience with the framework, we finally replicated the exact same response bodies as in NestJS. The results were better than what we expected: Endpoints had an average of 150ms ! (From 2s -> 150ms: 12x faster)

Bonus: the Rust API consumed only 50M of RAM, while NestJS consumed over 500M during peak time !! (90% lower memory usage)

There wasn't much to debate, we deployed that Rust API to replace the NestJS one, only for the endpoints frequently accessed by Webeye.

The user experience greatly improved, while our codebase was split across 2 completely different languages. That’s a debate for another day ;)

Being stubborn about keeping NestJS API

Although the Rust API was working very well, it still stayed in mind: "How can NestJS be so slow, and still so widely used", so I kept digging!

During a long night of trying to find a solution, I started noticing something strange:

Requests made from the browser: 2 seconds
Requests made from Postman: 2 seconds
Requests made from curl: 200ms

How is it possible that I wasn't getting the same results from curl ?

Let's try to find the difference between Postman and curl !

I spent a few hours trying to figure out why Postman was so slow at processing requests, only to find that it wasn't Postman's fault....

Postman, by default, sends headers with the request, including Accept-Encoding: Accept-Encoding: br, deflate, gzip, *
Which means that the API will try, in order and if it supports it, to compress the response payload with Brotli, Deflate, Gzip, etc...

When disabling that default header, Postman also started making 200ms requests!
We finally found the culprit!

After a bit of digging on the internet, I found the following GitHub issue on the NestJS repo: perf: brotli compression

Basically, it said that Brotli was configured by default on the max setting, which took time to compress the whole payload

Although the Github Issue recommends setting a lower value to speed up the process, we decided to completely disable Brotli in our APIs as any value we tried to pass would always take longer than the other encodings.

Previous NestJs Bootstrap code:

await app.register<FastifyCompressOptions>(compression);

New NestJs Bootstrap code:

await app.register<FastifyCompressOptions>(compression, {
  // Disable `br` encoding
  encodings: ['deflate', 'gzip', 'identity'],
});

One might ask: "Aren't you trying to save bandwidth to reduce egress fees?", to which we happily reply that we are hosted at OVH in France, and there are 0 egress fees !

With this fix in place, our NestJS APIs are now 10x faster (2s -> 200ms) on big payloads, which is way better than before. With those performances, switching to Rust wouldn't have been worth it (except for the RAM usage - but that's not today's problem).

To conclude

While trying to reduce the response time of our APIs, we enabled NestJS compression middleware, which turned out to be configured at the maximum setting and was the reason our endpoints were taking so long!

DEV Community