Artur

Posted on Jan 22, 2024

Optimizing Angular Express Server

#angular #node #performance

I've been pondering the possibility of improving the performance of the server used for Angular Universal applications, specifically in terms of requests per second. Without delving too deep into Angular's internals and its change detection processes, we typically encounter scenarios with a mix of pre-rendered pages and pages that always undergo the Angular rendering process on the server. A more efficient origin server would naturally be capable of handling a higher volume of requests. Let's explore various strategies to improve the speed of our origin servers. It's important to note that we won't be considering infrastructure-level optimizations or anything extending beyond the origin server, such as CloudFlare. Our focus will be on code-level improvements.

⚠️ The link to the GitHub repository will be provided at the end.

For this article, I have a simple Angular application that consists solely of an app.component.ts. The template is as follows:

<form>
  <mat-form-field>
    <mat-label>First name</mat-label>
    <input matInput />
  </mat-form-field>
  <mat-form-field>
    <mat-label>Last name</mat-label>
    <input matInput />
  </mat-form-field>
  <button mat-raised-button>Submit</button>
</form>

I intentionally utilized Material to ensure Angular includes its CSS, thereby increasing the final HTML size.

After building the app, I executed the server with node and utilized the autocannon tool to benchmark the server:

$ autocannon -c 100 -d 3 http://localhost:4200
Running 3s test @ http://localhost:4200
100 connections

┌─────────┬───────┬───────┬───────┬────────┬─────────┬──────────┬────────┐
│ Stat    │ 2.5%  │ 50%   │ 97.5% │ 99%    │ Avg     │ Stdev    │ Max    │
├─────────┼───────┼───────┼───────┼────────┼─────────┼──────────┼────────┤
│ Latency │ 29 ms │ 38 ms │ 62 ms │ 126 ms │ 40.8 ms │ 13.48 ms │ 153 ms │
└─────────┴───────┴───────┴───────┴────────┴─────────┴──────────┴────────┘
┌───────────┬────────┬────────┬────────┬────────┬─────────┬─────────┬────────┐
│ Stat      │ 1%     │ 2.5%   │ 50%    │ 97.5%  │ Avg     │ Stdev   │ Min    │
├───────────┼────────┼────────┼────────┼────────┼─────────┼─────────┼────────┤
│ Req/Sec   │ 1879   │ 1879   │ 2433   │ 2901   │ 2404.34 │ 417.73  │ 1879   │
├───────────┼────────┼────────┼────────┼────────┼─────────┼─────────┼────────┤
│ Bytes/Sec │ 138 MB │ 138 MB │ 179 MB │ 213 MB │ 176 MB  │ 30.6 MB │ 138 MB │
└───────────┴────────┴────────┴────────┴────────┴─────────┴─────────┴────────┘

Req/Bytes counts sampled once per second.
# of samples: 3

7k requests in 3.02s, 529 MB read

Please note that these results may vary between different operating systems and computers.

Performance tends to degrade as the complexity of any app increases. Often, we overlook performance until later in the project life cycle (me too).

In the next steps, we will inspect what's happening behind the scenes and attempt to increase the number of requests within those 3 seconds.

Inspecting Flame Graph

We can inspect the flame graph by using the 0x tool. All we need to do is run the server, replacing the node command with 0x. Then, we can view the generated flame graph:

We can observe the etag function which has the following label:

*etag dist/domino-perf/server/server.mjs:26153:18
Top of Stack: 15.1% (1355 of 8964 samples)
On Stack: 15.1% (1355 of 8964 samples)

The 'Top of Stack' metric, at 15.1%, indicates that this function was at the top of the call stack for 15.1% of the time during the profiler's sample recording.

The etag function, part of the express package, is executed whenever res.send is called, as it contains the following in its implementation:

res.send = function send2(body) {
  // ...

  var etagFn = app2.get("etag fn");
  var generateETag = !this.get("ETag") && typeof etagFn === "function";
};

this.get indicates that it's retrieving something from the app settings. Therefore, we can disable the etag by setting etag to false before starting the server:

const commonEngine = new CommonEngine();
server.set('etag', false);
server.set('view engine', 'html');
server.set('views', browserDistFolder);

Simply by disabling the etag setting, I reran the server and the autocannon tool:

$ autocannon -c 100 -d 3 http://localhost:4200
Running 3s test @ http://localhost:4200
100 connections


┌─────────┬───────┬───────┬───────┬───────┬──────────┬──────────┬────────┐
│ Stat    │ 2.5%  │ 50%   │ 97.5% │ 99%   │ Avg      │ Stdev    │ Max    │
├─────────┼───────┼───────┼───────┼───────┼──────────┼──────────┼────────┤
│ Latency │ 19 ms │ 24 ms │ 45 ms │ 57 ms │ 27.02 ms │ 12.09 ms │ 151 ms │
└─────────┴───────┴───────┴───────┴───────┴──────────┴──────────┴────────┘
┌───────────┬────────┬────────┬────────┬────────┬────────┬─────────┬────────┐
│ Stat      │ 1%     │ 2.5%   │ 50%    │ 97.5%  │ Avg    │ Stdev   │ Min    │
├───────────┼────────┼────────┼────────┼────────┼────────┼─────────┼────────┤
│ Req/Sec   │ 2583   │ 2583   │ 4017   │ 4291   │ 3630   │ 748.69  │ 2583   │
├───────────┼────────┼────────┼────────┼────────┼────────┼─────────┼────────┤
│ Bytes/Sec │ 189 MB │ 189 MB │ 295 MB │ 315 MB │ 266 MB │ 54.9 MB │ 189 MB │
└───────────┴────────┴────────┴────────┴────────┴────────┴─────────┴────────┘

Req/Bytes counts sampled once per second.
# of samples: 3

11k requests in 3.02s, 798 MB read

For some reason, the number of requests per 3 seconds has increased.

The issue is that disabling the ETag setting is not a proper solution in our case, as it is used to identify the resource version. Express generates weak ETags (prefixed with W/) for resources by default (refer to the defaultConfiguration function when the application is created). If the ETag is not changed for the server-side rendered content, the Express server would only send a 304 status code and not the content through the network.

Now, let's examine the etag implementation used by Express:

// node_modules/etag/index.js
function entitytag (entity) {
  if (entity.length === 0) {
    // fast-path empty
    return '"0-2jmj7l5rSw0yVb/vlWAYkK/YBwk"'
  }

  // compute hash of entity
  var hash = crypto
    .createHash('sha1')
    .update(entity, 'utf8')
    .digest('base64')
    .substring(0, 27)

  // compute length of entity
  var len = typeof entity === 'string'
    ? Buffer.byteLength(entity, 'utf8')
    : entity.length

  return '"' + len.toString(16) + '-' + hash + '"'
}

So, it essentially generates a SHA-1 hash, retrieves the computed hash in base64 format, and returns only the first 27 characters from the computed hash. While the SHA-1 algorithm might seem like overkill for generating ETags, and since security is not a concern in this case, we could have opted for the MD5 algorithm. However, after checking the MD5 hash, it doesn't appear to be faster. I then explored the CRC-32 algorithm, commonly used for calculating checksums, but it can also serve to generate ETags for resources. The output value from the CRC-32 is a 32-bit unsigned integer. The zlib library contains the CRC-32 function implementation, allowing us to create a C++ addon that generates the 32-bit checksum:

#include <js_native_api.h>
#include <node_api.h>
#include <zlib.h>

#include <cstring>
#include <string>

const char* fast_path = "W/\"0\"";

napi_value etag(napi_env env, napi_callback_info info) {
  size_t argc = 1;
  napi_value argv[1];
  napi_get_cb_info(env, info, &argc, argv, NULL, NULL);

  size_t buffer_length;
  void* buffer;
  napi_get_buffer_info(env, argv[0], &buffer, &buffer_length);

  napi_value result;

  if (buffer_length == 0) {
    // Fast-path for empty buffer.
    napi_create_string_utf8(env, fast_path, 5, &result);
    return result;
  }

  char* buffer_content = new char[buffer_length + 1]();
  // Faster than `memcpy` from `string.h`.
  std::memcpy(buffer_content, buffer, buffer_length);
  buffer_content[buffer_length] = '\n';

  // Calculate CRC-32 over the entire content.
  uint32_t crc = crc32(0L, Z_NULL, 0);
  crc = crc32(crc, reinterpret_cast<Bytef*>(buffer_content), buffer_length);

  delete[] buffer_content;

  // Create the final string.
  // `std::to_string` is faster than `std::format` and `std::stringstream`.
  std::string final =
      "W/\"" + std::to_string(buffer_length) + "-" + std::to_string(crc) + "\"";
  napi_create_string_utf8(env, final.c_str(), final.size(), &result);

  return result;
}

napi_value init(napi_env env, napi_value exports) {
  napi_value etag_fn;
  napi_create_function(env, NULL, NAPI_AUTO_LENGTH, etag, NULL, &etag_fn);
  napi_set_named_property(env, exports, "etag", etag_fn);

  return exports;
}

NAPI_MODULE(NODE_GYP_MODULE_NAME, init)

This is a bit mixed C and C++ code.

I had previously examined OpenSSL's implementations of SHA1 and MD5:

#include <openssl/sha.h>

unsigned char hash[SHA_DIGEST_LENGTH];

SHA1(reinterpret_cast<const unsigned char*>(etag_string),
     buffer_length,
     hash);

However, they're slower compared to CRC-32.

To set the custom ETag function, it's necessary to override the etag setting:

const { etag } = require(`${process.cwd()}/build/Release/etag.node`);

const etagFn = server.get('etag fn');
server.set('etag', (body, encoding) => {
  // Faster than `Buffer.isBuffer`.
  if (body?.buffer) {
    return etag(body);
  } else {
    return etagFn(body, encoding);
  }
});

Now, let's run the server with 0x and execute the autocannon tool again:

$ autocannon -c 100 -d 3 http://localhost:4200
Running 3s test @ http://localhost:4200
100 connections


┌─────────┬───────┬───────┬───────┬────────┬──────────┬──────────┬────────┐
│ Stat    │ 2.5%  │ 50%   │ 97.5% │ 99%    │ Avg      │ Stdev    │ Max    │
├─────────┼───────┼───────┼───────┼────────┼──────────┼──────────┼────────┤
│ Latency │ 23 ms │ 27 ms │ 47 ms │ 120 ms │ 30.53 ms │ 12.62 ms │ 159 ms │
└─────────┴───────┴───────┴───────┴────────┴──────────┴──────────┴────────┘
┌───────────┬────────┬────────┬────────┬────────┬─────────┬────────┬────────┐
│ Stat      │ 1%     │ 2.5%   │ 50%    │ 97.5%  │ Avg     │ Stdev  │ Min    │
├───────────┼────────┼────────┼────────┼────────┼─────────┼────────┼────────┤
│ Req/Sec   │ 2299   │ 2299   │ 3501   │ 3859   │ 3219.67 │ 667.22 │ 2298   │
├───────────┼────────┼────────┼────────┼────────┼─────────┼────────┼────────┤
│ Bytes/Sec │ 169 MB │ 169 MB │ 257 MB │ 283 MB │ 236 MB  │ 49 MB  │ 169 MB │
└───────────┴────────┴────────┴────────┴────────┴─────────┴────────┴────────┘

Req/Bytes counts sampled once per second.
# of samples: 3

10k requests in 3.02s, 708 MB read

Let's look at the flamegraph:

The etag function is no longer at the top of the stack. Now, I will generate a flame graph using the perf tool that comes with Linux. This should provide additional information, especially about internal and system calls:

We can observe node:internal/fs/promises on the stack. The line 37052 has the retrieveSSGPage method, which is a part of the CommonEngine. It has the following:

const pagePath = join(publicPath, pathname, "index.html");
if (this.pageIsSSG.get(pagePath)) {
  return fs.promises.readFile(pagePath, "utf-8");
}

The pagePath is an absolute path to the browser/index.html file. The content of the page is not cached, and the file is read for every request. Let's update the CommonEngine constructor by adding the cache property and then placing the content into the cache inside the retrieveSSGPage method. I will make this update directly within the dist/{app}/server/server.mjs file.

var CommonEngine = class {
  constructor(options) {
    this.options = options;
    this.inlineCriticalCssProcessor = new InlineCriticalCssProcessor({
      minify: false,
    });
    this.cache = new Map(); // 👈
  }

  retrieveSSGPage() {
    // ...
    if (this.pageIsSSG.get(pagePath)) {
      if (this.cache.has(pagePath)) {
        return this.cache.get(pagePath); // 👈
      } else {
        const content = yield fs.promises.readFile(pagePath, 'utf-8');
        this.cache.set(pagePath, content);
        return content;
      }
    }
  }
};

Now, let's run the server and benchmark again:

$ autocannon -c 100 -d 3 http://localhost:4200 
Running 3s test @ http://localhost:4200
100 connections


┌─────────┬───────┬───────┬───────┬───────┬──────────┬──────────┬────────┐
│ Stat    │ 2.5%  │ 50%   │ 97.5% │ 99%   │ Avg      │ Stdev    │ Max    │
├─────────┼───────┼───────┼───────┼───────┼──────────┼──────────┼────────┤
│ Latency │ 16 ms │ 21 ms │ 41 ms │ 61 ms │ 23.96 ms │ 11.37 ms │ 136 ms │
└─────────┴───────┴───────┴───────┴───────┴──────────┴──────────┴────────┘
┌───────────┬────────┬────────┬────────┬────────┬─────────┬─────────┬────────┐
│ Stat      │ 1%     │ 2.5%   │ 50%    │ 97.5%  │ Avg     │ Stdev   │ Min    │
├───────────┼────────┼────────┼────────┼────────┼─────────┼─────────┼────────┤
│ Req/Sec   │ 2771   │ 2771   │ 4491   │ 4975   │ 4078.34 │ 945.31  │ 2770   │
├───────────┼────────┼────────┼────────┼────────┼─────────┼─────────┼────────┤
│ Bytes/Sec │ 203 MB │ 203 MB │ 330 MB │ 365 MB │ 299 MB  │ 69.4 MB │ 203 MB │
└───────────┴────────┴────────┴────────┴────────┴─────────┴─────────┴────────┘

Req/Bytes counts sampled once per second.
# of samples: 3

12k requests in 3.02s, 897 MB read

Now, on the perf flame graph, we can observe that regular expressions also consume a considerable amount of time:

We can patch RegExp.prototype functions to track all of the invocations per request:

server.listen(port, () => {
  console.log(`Node Express server listening on http://localhost:${port}`);

  ['test', 'exec'].forEach(fn => {
    const proto = RegExp.prototype as any;
    const originalFn = proto[fn];
    proto[fn] = function (...args: any[]) {
      console.error('fn is called: ', fn);
      console.trace();
      return originalFn.apply(this, args);
    };
  });
});

If we build, run the server, and hit the index page once, we can observe all of the prototype logs. The exec function is called 11 times, and test is called 7 times per request. Since regular expressions deal with strings, we can cache results for strings that have already matched the specified pattern. One of the packages that actively uses regular expressions is content-type and has the following patterns:

// node_modules/content-type/index.js
var PARAM_REGEXP =
  /; *([!#$%&'*+.^_`|~0-9A-Za-z-]+) *= *("(?:[\u000b\u0020\u0021\u0023-\u005b\u005d-\u007e\u0080-\u00ff]|\\[\u000b\u0020-\u00ff])*"|[!#$%&'*+.^_`|~0-9A-Za-z-]+) */g;
var TEXT_REGEXP = /^[\u000b\u0020-\u007e\u0080-\u00ff]+$/;
var TOKEN_REGEXP = /^[!#$%&'*+.^_`|~0-9A-Za-z-]+$/;
var QESC_REGEXP = /\\([\u000b\u0020-\u00ff])/g;
var QUOTE_REGEXP = /([\\"])/g;
var TYPE_REGEXP = /^[!#$%&'*+.^_`|~0-9A-Za-z-]+\/[!#$%&'*+.^_`|~0-9A-Za-z-]+$/;

While these patterns may seem complicated and extensive, the V8 engine boasts an exceptionally fast regular expressions engine compared to the C++ standard regular expressions library (std::regex), RE-flex, and re2.

The content-type package exports the parse and format functions, which run multiple exec and test functions also within the while loop:

function format(obj) {
  // ...

  for (var i = 0; i < params.length; i++) {
    param = params[i];

    if (!TOKEN_REGEXP.test(param)) {
      throw new TypeError('invalid parameter name');
    }

    string += '; ' + param + '=' + qstring(parameters[param]);
  }

  // ...
}

function parse(string) {
  // ...

  while ((match = PARAM_REGEXP.exec(header))) {
    // ...
  }

  // ...
}

I also checked the regex benchmark at https://github.com/mariomka/regex-benchmark?tab=readme-ov-file#performance, which lists the Nim language at the top, stating it's significantly faster than other implementations. Before making any JavaScript changes, I decided to assess whether it would bring any benefits. The necessary steps include:

writing Nim code, which needs to be compiled into a static library
developing a C++ addon to act as a caller proxy to the Nim function, as we'll link against the static library
writing the JavaScript code to call the addon

I will place the Nim code into the nim folder:

// nim/matches.nim
import regex

const
  TYPE_REGEXP = re2"^[!#$%&'*+.^_`|~0-9A-Za-z-]+\/[!#$%&'*+.^_`|~0-9A-Za-z-]+$"

proc matchesTypeRegexp(source: cstring): bool {.cdecl, exportc.} =
  return match($source, TYPE_REGEXP)

We now need to compile it into a static library by running the following command:

$ nim c --app:staticlib --nimblePath:. --passC:-fPIC -d:release matches.nim

The above command will generate the libmatches.a file. Now, we need to write the C++ code that will call the matchesTypeRegexp function with the string provided by JavaScript:

#include <js_native_api.h>
#include <node_api.h>

extern "C" {
bool matchesTypeRegexp(char*);
}

napi_value matches_type_regexp(napi_env env, napi_callback_info info) {
  size_t argc = 1;
  napi_value argv[1];
  napi_get_cb_info(env, info, &argc, argv, NULL, NULL);

  size_t source_length;
  napi_get_value_string_utf8(env, argv[0], NULL, 0, &source_length);

  char* source = new char[source_length + 1]();
  napi_get_value_string_utf8(env, argv[0], source, source_length + 1, NULL);
  napi_value result;
  napi_get_boolean(env, matchesTypeRegexp(source), &result);
  delete[] source;
  return result;
}

napi_value init(napi_env env, napi_value exports) {
  napi_value matches_type_regexp_fn;
  napi_create_function(env, NULL, NAPI_AUTO_LENGTH, matches_type_regexp, NULL,
                       &matches_type_regexp_fn);
  napi_set_named_property(env, exports, "matchesTypeRegexp",
                          matches_type_regexp_fn);

  return exports;
}

NAPI_MODULE(NODE_GYP_MODULE_NAME, init)

The binding.gyp file should be also updated to include the .a file:

{
  'targets': [
    {
      'target_name': 'content_type',
      'defines': ['NDEBUG', 'NAPI_DISABLE_CPP_EXCEPTIONS'],
      'sources': ['native/content-type.cc'],
      'libraries': ['<(module_root_dir)/nim/libmatches.a'],
      'cflags_cc': ['-std=c++17', '-fexceptions', '-O3', '-Wall', '-Wextra']
    }
  ]
}

Now, after running node-gyp build and obtaining the dynamically linked library within the build folder, we can test whether the addon is faster or not:

// benchmark.js
const { matchesTypeRegexp } = require('./build/Release/content_type.node');

const { performance } = require('perf_hooks');

let t0, t1;

const TYPE_REGEXP =
  /^[!#$%&'*+.^_`|~0-9A-Za-z-]+\/[!#$%&'*+.^_`|~0-9A-Za-z-]+$/;

t0 = performance.now();
for (let i = 0; i < 1e6; i++) {
  let v = 'text/html'.match(TYPE_REGEXP);
}
t1 = performance.now();
console.log(`JS: ${t1 - t0}ms`);

t0 = performance.now();
for (let i = 0; i < 1e6; i++) {
  let v = matchesTypeRegexp('text/html');
}
t1 = performance.now();
console.log(`Nim: ${t1 - t0}ms`);

If we execute the above file we'll get the following results:

$ node benchmark.js
JS: 25.362835999578238ms
Nim: 886.7281489986926ms

The benchmark indicates that the JS implementation is significantly faster compared to the Nim implementation. However, it's worth noting that internally, the JS implementation deals with more code in this case when calling the Nim function. Every time we call the C++ addon function, V8 has to invoke its built-in functions, which handle calling external DLL functions. If we debug the Node process running the file with a single call to matchesTypeRegexp('text/html'), we'll observe the following:

#0  0x00007ffff7e34442 in matches_type_regexp(napi_env__*, napi_callback_info__*) ()
#1  0x0000000000b10d7d in v8impl::(anonymous namespace)::FunctionCallbackWrapper::Invoke(v8::FunctionCallbackInfo<v8::Value> const&) ()
#2  0x0000000000db0230 in v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiCallHelper<false>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::FunctionTemplateInfo>, v8::internal::Handle<v8::internal::Object>, v8::internal::BuiltinArguments) ()
#3  0x0000000000db176f in v8::internal::Builtin_HandleApiCall(int, unsigned long*, v8::internal::Isolate*) ()
#4  0x00000000016ef579 in Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_BuiltinExit ()

While this also requires being a regex expert to improve those regular expressions, it may be easier and sufficient to cache results and avoid executing regular expressions against the same arguments. We can create a local cache and refrain from running test and exec against the argument if it has already been formatted or parsed:

const formatCache = new Map();

function format(obj) {
  if (!obj || typeof obj !== 'object') {
    throw new TypeError('argument obj is required');
  }

  const cacheKey = JSON.stringify(obj);
  if (formatCache.has(cacheKey)) {
    return formatCache.get(cacheKey);
  }

  // ...

  formatCache.set(cacheKey, string);

  return string;
}

const parseCache = new Map();

function parse(string) {
  // ...

  var header = typeof string === 'object' ? getcontenttype(string) : string;

  if (typeof header !== 'string') {
    throw new TypeError('argument string is required to be a string');
  }

  if (parseCache.has(header)) {
    return JSON.parse(parseCache.get(header));
  }

  // ...

  parseCache.set(header, JSON.stringify(obj));
  return obj;
}

After these changes I ran the build and the benchmark again and got the following results:

$ autocannon -c 100 -d 3 http://localhost:4200
Running 3s test @ http://localhost:4200
100 connections


┌─────────┬───────┬───────┬───────┬───────┬──────────┬──────────┬────────┐
│ Stat    │ 2.5%  │ 50%   │ 97.5% │ 99%   │ Avg      │ Stdev    │ Max    │
├─────────┼───────┼───────┼───────┼───────┼──────────┼──────────┼────────┤
│ Latency │ 12 ms │ 19 ms │ 42 ms │ 50 ms │ 21.61 ms │ 11.93 ms │ 144 ms │
└─────────┴───────┴───────┴───────┴───────┴──────────┴──────────┴────────┘
┌───────────┬────────┬────────┬────────┬────────┬────────┬─────────┬────────┐
│ Stat      │ 1%     │ 2.5%   │ 50%    │ 97.5%  │ Avg    │ Stdev   │ Min    │
├───────────┼────────┼────────┼────────┼────────┼────────┼─────────┼────────┤
│ Req/Sec   │ 2885   │ 2885   │ 4951   │ 5723   │ 4519   │ 1197.63 │ 2884   │
├───────────┼────────┼────────┼────────┼────────┼────────┼─────────┼────────┤
│ Bytes/Sec │ 212 MB │ 212 MB │ 363 MB │ 420 MB │ 331 MB │ 87.8 MB │ 212 MB │
└───────────┴────────┴────────┴────────┴────────┴────────┴─────────┴────────┘

Req/Bytes counts sampled once per second.
# of samples: 3

14k requests in 3.02s, 994 MB read

The framework itself may also slow things down because using the createServer in conjunction with engine.render would allow having more requests per 3 seconds:

const server = createServer(async (req, res) => {
  const html = await commonEngine.render({
    bootstrap,
    documentFilePath: indexHtml,
    url: `http://${req.headers.host}${req.url}`,
    publicPath: browserDistFolder,
    providers: [{ provide: APP_BASE_HREF, useValue: '' }],
  });

  res.end(html);
});

Results after running autocannon look as follows:

$ autocannon -c 100 -d 3 http://localhost:4200
Running 3s test @ http://localhost:4200
100 connections


┌─────────┬──────┬───────┬───────┬───────┬──────────┬─────────┬────────┐
│ Stat    │ 2.5% │ 50%   │ 97.5% │ 99%   │ Avg      │ Stdev   │ Max    │
├─────────┼──────┼───────┼───────┼───────┼──────────┼─────────┼────────┤
│ Latency │ 9 ms │ 11 ms │ 24 ms │ 41 ms │ 12.93 ms │ 8.83 ms │ 137 ms │
└─────────┴──────┴───────┴───────┴───────┴──────────┴─────────┴────────┘
┌───────────┬────────┬────────┬────────┬────────┬─────────┬─────────┬────────┐
│ Stat      │ 1%     │ 2.5%   │ 50%    │ 97.5%  │ Avg     │ Stdev   │ Min    │
├───────────┼────────┼────────┼────────┼────────┼─────────┼─────────┼────────┤
│ Req/Sec   │ 5011   │ 5011   │ 8503   │ 8815   │ 7440.67 │ 1723.46 │ 5008   │
├───────────┼────────┼────────┼────────┼────────┼─────────┼─────────┼────────┤
│ Bytes/Sec │ 367 MB │ 367 MB │ 623 MB │ 646 MB │ 545 MB  │ 126 MB  │ 367 MB │
└───────────┴────────┴────────┴────────┴────────┴─────────┴─────────┴────────┘

Req/Bytes counts sampled once per second.
# of samples: 3

22k requests in 3.02s, 1.64 GB read

Since Angular SSR is framework-agnostic and no longer directly relies on the Express engine (as it did with @nguniversal/express-engine), we're flexible to use any framework. We only have to run the engine.render when necessary to return HTML for specific route. Initially, I installed and benchmarked Fastify, but it didn't yield to significant benefits. It handled a comparable number of requests to Express when ETag setting was disabled (since Fastify doesn't set the ETag header by default). Afterwards, I installed Koa and its features such as koa-static to serve static files. Koa outperformed both Fastify and Express in terms of speed. Here's the code I used for Koa, it resembles the Express code we had previously:

const commonEngine = new CommonEngine();
const app = new Koa();
const router = new Router();

router.get(/.*/, async ctx => {
  const { protocol, headers, originalUrl } = ctx;

  const html = await commonEngine.render({
    bootstrap,
    documentFilePath: indexHtml,
    url: `${protocol}://${headers.host}${originalUrl}`,
    publicPath: browserDistFolder,
    providers: [{ provide: APP_BASE_HREF, useValue: '' }],
  });

  ctx.body = html;
});

app.use(serveStatic(browserDistFolder));
app.use(router.routes());

These are results after running autocannon:

$ autocannon -c 100 -d 3 http://localhost:4200
Running 3s test @ http://localhost:4200
100 connections


┌─────────┬───────┬───────┬───────┬───────┬──────────┬──────────┬────────┐
│ Stat    │ 2.5%  │ 50%   │ 97.5% │ 99%   │ Avg      │ Stdev    │ Max    │
├─────────┼───────┼───────┼───────┼───────┼──────────┼──────────┼────────┤
│ Latency │ 12 ms │ 15 ms │ 39 ms │ 52 ms │ 18.68 ms │ 12.23 ms │ 163 ms │
└─────────┴───────┴───────┴───────┴───────┴──────────┴──────────┴────────┘
┌───────────┬────────┬────────┬────────┬────────┬─────────┬─────────┬────────┐
│ Stat      │ 1%     │ 2.5%   │ 50%    │ 97.5%  │ Avg     │ Stdev   │ Min    │
├───────────┼────────┼────────┼────────┼────────┼─────────┼─────────┼────────┤
│ Req/Sec   │ 3145   │ 3145   │ 5935   │ 6547   │ 5208.34 │ 1480.24 │ 3145   │
├───────────┼────────┼────────┼────────┼────────┼─────────┼─────────┼────────┤
│ Bytes/Sec │ 231 MB │ 231 MB │ 435 MB │ 480 MB │ 382 MB  │ 108 MB  │ 230 MB │
└───────────┴────────┴────────┴────────┴────────┴─────────┴─────────┴────────┘

Req/Bytes counts sampled once per second.
# of samples: 3

16k requests in 3.02s, 1.15 GB read

So, this is only slightly faster (9%) than what we achieved previously with Express manipulations. However, we want to maintain the previous behavior with generating ETag for the content to be able to send a 304 response when the content has not changed. We need to install koa-conditional-get and koa-etag since they work in conjunction:

const conditional = require('koa-conditional-get');
const etag = require('koa-etag');

app.use(conditional());
app.use(etag());
app.use(serveStatic(browserDistFolder));
app.use(router.routes());

Now, let's build and run autocannon again:

$ autocannon -c 100 -d 3 http://localhost:4200
Running 3s test @ http://localhost:4200
100 connections


┌─────────┬───────┬───────┬───────┬───────┬──────────┬──────────┬────────┐
│ Stat    │ 2.5%  │ 50%   │ 97.5% │ 99%   │ Avg      │ Stdev    │ Max    │
├─────────┼───────┼───────┼───────┼───────┼──────────┼──────────┼────────┤
│ Latency │ 13 ms │ 19 ms │ 48 ms │ 67 ms │ 21.81 ms │ 14.25 ms │ 164 ms │
└─────────┴───────┴───────┴───────┴───────┴──────────┴──────────┴────────┘
┌───────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ Stat      │ 1%      │ 2.5%    │ 50%     │ 97.5%   │ Avg     │ Stdev   │ Min     │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Req/Sec   │ 2601    │ 2601    │ 5103    │ 5699    │ 4467    │ 1341.71 │ 2600    │
├───────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Bytes/Sec │ 2.34 MB │ 2.34 MB │ 4.59 MB │ 5.12 MB │ 4.02 MB │ 1.21 MB │ 2.34 MB │
└───────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

Req/Bytes counts sampled once per second.
# of samples: 3

13k requests in 3.02s, 12 MB read

Okay, so the ETag calculation slowed things down by 18%. The koe-etag also uses the etag package, but it does the following:

// node_modules/koa-etag/index.js
const calculate = require('etag')

module.exports = function etag (options) {
  return async function etag (ctx, next) {
    await next()
    const entity = await getResponseEntity(ctx)
    setEtag(ctx, entity, options)
  }
}

async function getResponseEntity (ctx) {
  // no body
  const body = ctx.body
  if (!body || ctx.response.get('etag')) return

  // type
  const status = ctx.status / 100 | 0

  // 2xx
  if (status !== 2) return

  if (body instanceof Stream) {
    if (!body.path) return
    return await stat(body.path)
  } else if ((typeof body === 'string') || Buffer.isBuffer(body)) {
    return body
  } else {
    return JSON.stringify(body)
  }
}

function setEtag (ctx, entity, options) {
  if (!entity) return

  ctx.response.etag = calculate(entity, options)
}

We may notice that it also checks whether the ctx.body is a stream. For pre-rendered files, it's actually a stream because koa-static uses koa-send, which can handle index files automatically when visiting the root location. For example, if Angular pre-renders the /home route, it places the output content within the browser/home/index.html. This file is then picked up by koa-send before it reaches engine.render. koa-send sets ctx.body to fs.createReadStream('path-to-index.html') each time the / URL is hit. Subsequently, koa-etag checks it's a stream (body instanceof Stream) and runs fs.stat(body) to retrieve the stats object.

We can update the koa-send code by implementing a small caching mechanism instead of executing fs.createReadStream each time the index.html need to be read:

// This could also be an LRU cache.
const cache = new Map()

async function send(ctx, path, opts = {}) {
  // ...

  if (!cache.has(path)) {
    const buffer = await fs.promises.readFile(path)
    cache.set(path, buffer)
  }

  ctx.body = cache.get(path)
}

Even though this would cache every file that is attempted to be read, we could extend the opts to specify a list of files that are never updated and should be cached forever.

We can then update koa-etag to use our C++ addon in cases where the provided argument is a buffer:

const calculate = require('etag')

const { etag } = require(`${process.cwd()}/build/Release/etag.node`)

function setEtag (ctx, entity, options) {
  if (!entity) return

  // Faster than `Buffer.isBuffer`.
  if (entity.buffer) {
    ctx.response.etag = etag(entity)
  } else {
    ctx.response.etag = calculate(entity, options)
  }
}

Let's build and run the benchmark again:

$ autocannon -c 100 -d 3 http://localhost:4200
Running 3s test @ http://localhost:4200
100 connections


┌─────────┬───────┬───────┬───────┬───────┬──────────┬──────────┬────────┐
│ Stat    │ 2.5%  │ 50%   │ 97.5% │ 99%   │ Avg      │ Stdev    │ Max    │
├─────────┼───────┼───────┼───────┼───────┼──────────┼──────────┼────────┤
│ Latency │ 11 ms │ 12 ms │ 30 ms │ 41 ms │ 14.43 ms │ 10.17 ms │ 154 ms │
└─────────┴───────┴───────┴───────┴───────┴──────────┴──────────┴────────┘
┌───────────┬────────┬────────┬────────┬────────┬────────┬─────────┬────────┐
│ Stat      │ 1%     │ 2.5%   │ 50%    │ 97.5%  │ Avg    │ Stdev   │ Min    │
├───────────┼────────┼────────┼────────┼────────┼────────┼─────────┼────────┤
│ Req/Sec   │ 4191   │ 4191   │ 7743   │ 8255   │ 6728   │ 1806.68 │ 4189   │
├───────────┼────────┼────────┼────────┼────────┼────────┼─────────┼────────┤
│ Bytes/Sec │ 307 MB │ 307 MB │ 568 MB │ 606 MB │ 494 MB │ 133 MB  │ 307 MB │
└───────────┴────────┴────────┴────────┴────────┴────────┴─────────┴────────┘

Req/Bytes counts sampled once per second.
# of samples: 3

20k requests in 3.02s, 1.48 GB read

Let's revisit the flamegraph and observe the frequent occurrence of the __async function at the top of the stack:

The __async function replaces async/await code due to zone.js's inability to track async/await. While zone.js tracks promises by overriding the promise constructor (Promise -> ZoneAwarePromise), this approach falls short when dealing with async/await. We're not going to delve deeply into zone.js complexities, it's worth noting that downleveling is only necessary for the Angular code. Third-party packages such as koa-etag doesn't require downleveling. However, this is done by ESBuild for every package. Downleveling transforms this:

module.exports = function etag (options) {
  return async function etag (ctx, next) {
    await next()
    const entity = await getResponseEntity(ctx)
    setEtag(ctx, entity, options)
  }
}

To this:

module.exports = function etag(options) {
  return function etag(ctx, next) {
    return __async(this, null, function* () {
      yield next();
      const entity = yield getResponseEntity(ctx);
      setEtag(ctx, entity, options);
    });
  };
};

For curiosity's sake, I moved Koa packages into a separate file called koa.ts (at the same level as server.ts) and re-exported the packages we're using in our example:

import Koa from 'koa';
import Router from '@koa/router';
import serveStatic from 'koa-static';
const conditional = require('koa-conditional-get');
const etag = require('koa-etag');

export { Koa, Router, serveStatic, conditional, etag };

And then I used await import('./koa') within the server.ts inside the app function (I had to mark it as async) so that ESBuild would create a separate chunk specifically for Koa third-party packages. It's then easier to modify the output file directly when the async/await is replaced with __async. After running the build, I went to server/chunk-HASH.mjs that contained Koa packages and replaced all downleveling transformations back from __async to async/await. After that, I reran the benchmark:

$ autocannon -c 100 -d 3 http://localhost:4200
Running 3s test @ http://localhost:4200
100 connections


┌─────────┬──────┬──────┬───────┬───────┬──────────┬────────┬────────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%   │ Avg      │ Stdev  │ Max    │
├─────────┼──────┼──────┼───────┼───────┼──────────┼────────┼────────┤
│ Latency │ 8 ms │ 8 ms │ 23 ms │ 31 ms │ 10.06 ms │ 7.4 ms │ 125 ms │
└─────────┴──────┴──────┴───────┴───────┴──────────┴────────┴────────┘
┌───────────┬────────┬────────┬────────┬────────┬────────┬─────────┬────────┐
│ Stat      │ 1%     │ 2.5%   │ 50%    │ 97.5%  │ Avg    │ Stdev   │ Min    │
├───────────┼────────┼────────┼────────┼────────┼────────┼─────────┼────────┤
│ Req/Sec   │ 5367   │ 5367   │ 11215  │ 11607  │ 9394   │ 2852.72 │ 5365   │
├───────────┼────────┼────────┼────────┼────────┼────────┼─────────┼────────┤
│ Bytes/Sec │ 394 MB │ 394 MB │ 823 MB │ 851 MB │ 689 MB │ 209 MB  │ 394 MB │
└───────────┴────────┴────────┴────────┴────────┴────────┴─────────┴────────┘

Req/Bytes counts sampled once per second.
# of samples: 3

28k requests in 3.02s, 2.07 GB read

Now, let's also try running this benchmark with Bun. Bun is not friendly yet with N-API for the time being this article is written (January, 2023). As thus, we have to revert koa-etag changes back and use the etag package instead of our C++ addon. We also need to remove zone.js from angular.json polyfills because Bun isn't friendly with its patches, no matter what, it just doesn't return any response (and says the response is empty) when zone.js is used. We should also add ɵprovideZonelessChangeDetection() to our app.config.ts, this function is exported from @angular/core since 17.1.0-rc.0 (please note that it may have already been exported without a theta symbol at the time you're reading this):

$ autocannon -c 100 -d 3 http://localhost:4200
Running 3s test @ http://localhost:4200
100 connections


┌─────────┬───────┬───────┬───────┬───────┬──────────┬─────────┬───────┐
│ Stat    │ 2.5%  │ 50%   │ 97.5% │ 99%   │ Avg      │ Stdev   │ Max   │
├─────────┼───────┼───────┼───────┼───────┼──────────┼─────────┼───────┤
│ Latency │ 16 ms │ 20 ms │ 34 ms │ 41 ms │ 20.55 ms │ 6.76 ms │ 95 ms │
└─────────┴───────┴───────┴───────┴───────┴──────────┴─────────┴───────┘
┌───────────┬────────┬────────┬────────┬────────┬────────┬─────────┬────────┐
│ Stat      │ 1%     │ 2.5%   │ 50%    │ 97.5%  │ Avg    │ Stdev   │ Min    │
├───────────┼────────┼────────┼────────┼────────┼────────┼─────────┼────────┤
│ Req/Sec   │ 4379   │ 4379   │ 4811   │ 5051   │ 4746   │ 278.06  │ 4377   │
├───────────┼────────┼────────┼────────┼────────┼────────┼─────────┼────────┤
│ Bytes/Sec │ 321 MB │ 321 MB │ 353 MB │ 370 MB │ 348 MB │ 20.4 MB │ 321 MB │
└───────────┴────────┴────────┴────────┴────────┴────────┴─────────┴────────┘

Req/Bytes counts sampled once per second.
# of samples: 3

14k requests in 3.02s, 1.04 GB read

I have also once heard about the Elysia framework (https://elysiajs.com). It's developed to run with Bun. Since we needed to update the ESBuild configuration with the external property to exclude elysia and @elysiajs/static from the compilation, I opted to directly update the dist/**/server.mjs file:

Running the benchmark:

$ autocannon -c 100 -d 3 http://localhost:4200
Running 3s test @ http://localhost:4200
100 connections


┌─────────┬──────┬──────┬───────┬───────┬─────────┬─────────┬───────┐
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%   │ Avg     │ Stdev   │ Max   │
├─────────┼──────┼──────┼───────┼───────┼─────────┼─────────┼───────┤
│ Latency │ 9 ms │ 9 ms │ 13 ms │ 20 ms │ 9.97 ms │ 3.87 ms │ 81 ms │
└─────────┴──────┴──────┴───────┴───────┴─────────┴─────────┴───────┘
┌───────────┬────────┬────────┬────────┬────────┬─────────┬─────────┬────────┐
│ Stat      │ 1%     │ 2.5%   │ 50%    │ 97.5%  │ Avg     │ Stdev   │ Min    │
├───────────┼────────┼────────┼────────┼────────┼─────────┼─────────┼────────┤
│ Req/Sec   │ 8983   │ 8983   │ 9783   │ 9831   │ 9529.34 │ 388.94  │ 8981   │
├───────────┼────────┼────────┼────────┼────────┼─────────┼─────────┼────────┤
│ Bytes/Sec │ 678 MB │ 678 MB │ 738 MB │ 742 MB │ 719 MB  │ 29.3 MB │ 678 MB │
└───────────┴────────┴────────┴────────┴────────┴─────────┴─────────┴────────┘

Req/Bytes counts sampled once per second.
# of samples: 3

29k requests in 3.02s, 2.16 GB read

This is faster compared to our previous benchmarks, considering all the changes we made to external packages.

Conclusion

The Express framework has become a "protagonist" in the Node.js ecosystem for many years. Node.js itself, as a runtime, is a reliable and robust technology, especially when compared to emerging runtimes like Bun. While Node.js is linked with the V8 engine and Bun is linked with the JSC engine, Bun may appear to be a less consistent runtime. This is because external libraries might directly utilize the V8 API (rather than an N-API), tying them exclusively to the Node.js execution environment. Despite this, I see great potential for Bun in the future. Whether to use Bun or not ultimately depends on your preference.

The code be found at: https://github.com/arturovt/ssr-perf.