DEV Community: Andrei Pechkurov

[V8 Deep Dives] Random Thoughts on Math.random()

Andrei Pechkurov — Fri, 02 Apr 2021 06:33:56 +0000

In previous parts of this series, we were discussing the internals of ES6 collections and arrays in V8. This time we will cover a simpler topic which is Math.random() function.

Every JS developer uses Math.random() once in a while in their applications for various use cases. The general wisdom says that Math.random() is good for anything, but security. That said, this function is not backed by a CSPRNG (cryptographically secure pseudorandom number generator) and shouldn’t be used in security-related tasks, like UUID v4 generation (note: if you dare to use UUIDs for such tasks).

Today we’ll try to understand how exactly V8 implements Math.random() function and then try to match our findings with the general wisdom.

TL;DR fans may want to jump to the last section of the blog post where you may find a summary.

Disclaimer. What’s written below are implementation details specific to V8 9.0 bundled with a recent dev version of Node.js (commit 52f9aaf to be more precise). As usual, you should not expect any behavior beyond the spec, as implementation details are subject to change in any V8 version.

Spec All the Things

Before looking at the code, let’s see what ECMAScript 2020 specification says about Math.random() function:

Returns a Number value with positive sign, greater than or equal to 0 but less than 1, chosen randomly or pseudo randomly with approximately uniform distribution over that range, using an implementation-dependent algorithm or strategy. This function takes no arguments.

Each Math.random function created for distinct realms must produce a distinct sequence of values from successive calls.

Ehmm, that’s not much. It appears that the spec leaves a lot of freedom for the implementers, like JS engines, leaving security-related aspects out of scope.

No luck with the spec and now, with a clean conscience, we can dive into V8 source code.

The Nitty-gritty Details

Our journey starts from the Math.random() code written in Torque language:

We can see that Math.random() (MathRandom here) calls the RefillMathRandom macro defined elsewhere (see extern macro). We’ll see what this macro does a bit later.

Next, we see that the value (random number) is not generated directly, but instead returned from a fixed-size array (array variable). Let’s call this array “entropy pool” (or simply “pool”) to make it recognizable through the rest of the text. The index (newSmiIndex integer) is decremented on each call and periodically, when it becomes zero, the RefillMathRandom macro gets called which intuitively should refill the pool, but we’re not sure about that yet.

The MathRandom macro is defined in the CodeStubAssembler C++ class and does not contain anything spectacular. It simply calls the MathRandom::RefillCache method through an external reference. Hence, the code we expect to refill the entropy pool is written in C++ and looks more or less like the following:

The above code is trimmed and simplified for readability purposes. As we expected, its overall logic is to generate and refill the entropy pool (the cache array). But there are a couple of other interesting details here.

First of all, block #1 from the snippet describes the initialization of the seed to be used in the subsequent number generation. This block runs only once and uses the PRNG available in the current V8 isolate to generate the seed. Then it calculates murmur3 hash codes based on the seed and stores it in the initial state.

The PRNG can be supplied by embedders, like Node.js or Chromium browser. If a PRNG is not supplied by the embedder, V8 falls back to a system-dependent source of randomness, like /dev/urandom in Linux.

Then, block #2 uses the state struct to generate and fill all kCacheSize values in the pool with a xorshift random number generator. The size of the pool is 64, i.e. after each 64 Math.random() calls the pool needs to be refilled.

Our takeaways here are the following. First, despite the fact that the initial seed used by Math.random() function may be generated with a cryptographically secure PRNG (note: that depends on the embedder and/or OS), the subsequent number generation doesn’t involve this PRNG. Instead, it uses xorshift128+ which is a fast random number generator algorithm, but it’s not cryptographically secure. Thus, we have found proof of the general wisdom and, indeed, V8’s implementation of Math.random() is not supposed to be used for security stuff.

Second, it also means that the generated number sequence will be deterministic in the case of the same initial seed value. Luckily, V8 supports the --random_seed flag to override the initial seed, so let’s see if our thinking is correct.

As expected, we used 42 as the seed value in two separate Node.js REPL sessions, and both times Math.random() produced exactly the same sequence of numbers.

Now, when we have a better understanding of the implementation, let’s try to understand the performance aspect of the entropy pool.

Some Silly Benchmarks

Before we go any further, I need to warn you that the following microbenchmarks are totally non-scientific, unfair benchmarks, so take them with a grain of salt. Benchmarks were done on my dev machine with i5–8400H CPU, Ubuntu 20.04, and Node.js v16.0.0-pre (commit 52f9aaf).

Our microbenchmark is terribly simple this time:

When run, it calls Math.random() in a loop and outputs the resulting throughput.

Armed with the benchmark, we’re going to compare kCacheSize=64 (the default) and kCacheSize=1 (no pool) builds of Node.js. Here is the measured result.

Math.random() benchmark (with and w/o entropy pool).

The benchmark shows that removing the pool makes Math.random() 22% slower. The difference is relatively small, yet the pool improves the throughput by removing the overhead of JS-to-C++ switches in each Math.random() call. Interestingly, that uuid npm package and, later, crypto.randomUUID() standard function from Node.js also employ a similar approach with the entropy pool (note: the difference is that they use a CSPRNG and the performance boost is much more significant).

It’s time to wrap up and summarize our findings.

Summary

As every JS developer knows, it’s a bad idea to use Math.random() for security related tasks. In browsers you can use Web Crypto API and Node.js users should go with the crypto module.
The initial seed used by Math.random() uses the PRNG supplied by the embedder (say, Node.js or browser) or falls back to an OS-dependent source of randomness, not necessarily a secure one.
Once the initial seed value is generated, later values are generated deterministically with xorshift128+ algorithm and stored in a pool of 64 items which is refilled when necessary. Determinism here means that in the case of the same initial seed value the generated number sequence returned from Math.random() will be the same.

Thanks for reading this post. Let me know if you have ideas for the next posts in V8 Deep Dives series. Feedback on inconsistencies or incorrect assumptions is also more than welcome.

[V8 Deep Dives] Understanding Array Internals

Andrei Pechkurov — Mon, 15 Mar 2021 16:21:21 +0000

Photo by Antonio Garcia on Unsplash

In the previous partof this series, we were discussing Map and Set, standard collections introduced in ES6. This time we will focus on JavaScript arrays.

Arrays, which are essentially list-like objects, are one of the core features of the language and every JavaScript developer has a solid experience in working with them. This blog post does not try to give you an understanding of the public API but instead aims to briefly go through various aspects of V8’s internal implementation of JS arrays that seem worthy to me: memory layout, size restrictions, and other interesting implementation details.

To keep things simpler, the remaining part of the blog post assumes that V8 is running on a 64-bit system.

TL;DR fans may want to jump to the last section of the blog post where you may find a summary.

Disclaimer. What’s written below are implementation details specific to V8 8.9 bundled with a recent dev version of Node.js (commit 49342fe to be more precise). As usual, you should not expect any behavior beyond the spec, as implementation details are subject to change in any V8 version.

Once Upon a Time in a REPL

You probably ask yourself: what may be simpler than a JavaScript array? It must be backed by a fixed-size array, i.e. a contiguous chunk of memory. All operations should be straight-forward manipulations with data stored in the underlying array. But as we will see later, the reality is a bit more complicated than that.

To make things more practical, we will observe internal transformations of an array in a Node.js REPL. Fewer words, more code, so let’s run it:

$ node — allow-natives-syntax

Welcome to Node.js v16.0.0-pre.

Type “.help” for more information.

>

We are using the --allow-natives-syntaxflag to be able to use the %DebugPrint() V8 function. This function prints internal debug information for the given object or primitive value.

Now let’s create an empty array and print its debug information:

> const arr = [];
undefined
> %DebugPrint(arr);
DebugPrint: 0x3db6370d4e51: [JSArray]
 - map: 0x3de594a433f9 <Map(PACKED_SMI_ELEMENTS)> [FastProperties]
 - prototype: 0x3a5538d05849 <JSArray[0]>
 - elements: 0x357222481309 <FixedArray[0]> [PACKED_SMI_ELEMENTS]
 - length: 0
 - properties: 0x357222481309 <FixedArray[0]>
 - All own properties (excluding elements): {
    0x357222484909: [String] in ReadOnlySpace: #length: 0x0f4cc91c1189 <AccessorInfo> (const accessor descriptor), location: descriptor
 }
...

[]

The original output is quite lengthy, so I trimmed it. What we’re interested in is the - elements: ... [PACKED_SMI_ELEMENTS] part of the output. It tells us that our array uses a fixed-size array to store the data (V8 uses the “backing store” term for this), just as we expected. The size of that array is zero.

The debug print also tells us that our JS array has PACKED_SMI_ELEMENTS elements kind. An element kind is a metadata tracked by V8 to optimize array operations. It describes the types of elements stored in the array. If you’re not familiar with the concept, you should read this great blog post from V8 team.

PACKED_SMI_ELEMENTS is the most specific elements kind which means that all items in the array are Smis, small integers from the -2³¹ to 2³¹-1 range. Based on this metadata, V8 can avoid unnecessary checks and value conversions when dealing with the array. Another important aspect for us is the following. When a JS array is modified, its elements kind may transition from a more specific kind to a less specific one, but not the other way around. For instance, if an array’s elements kind changes from PACKED_SMI_ELEMENTS to something else due to insertion, there is no way back to the original (more specific) kind for this particular array instance.

To see how the internal array grows, we’re going to add its first element, a small integer number:

> arr.push(42);
> %DebugPrint(arr);
DebugPrint: 0xe61bd5eb321: [JSArray] in OldSpace
...
 - elements: 0x0e61bd5e7501 <FixedArray[17]> [PACKED_SMI_ELEMENTS]
 - length: 1
...
 - elements: 0x0e61bd5e7501 <FixedArray[17]> {
           0: 42
        1-16: 0x357222481669 <the_hole>
 }
...

[42]

Here we see that the internal array used as the backing store has changed to [PACKED_SMI_ELEMENTS]. The new array has the same elements kind, but a different address, and the internal array size equal to 17. On our 64-bit system, this means that it takes 17 * 8=136 bytes of memory (for the sake of simplicity, we ignore object headers). It also means that the allocated internal array is bigger than what we requested. This allows V8 to achieve constant amortized time for push() and similar operations that grow the array. The following formula is used to determine the new size in situations when the internal array is not enough:

new_capacity = (old_capacity + 50%) + 16

Here, old_capacity stands for the old internal array size plus the number of inserted items, hence in our case it’s equal to 1 and new_capacity is calculated as 1 + 16 = 17.

There is one more interesting detail in the above output. Namely, the 1-16: ... text in the array contents tells us that the unused part of the internal array is filled with “the hole”. The hole is a special value used by V8 to mark unassigned or deleted array items (and not only them). It’s an implementation detail that never “leaks” into JS code. In our example, V8 uses the hole to initialize the unused fraction of the array.

You may wonder if the internal array ever shrinks. It appears that it does shrink on operations that decrease the array length such as pop() or shift(). This happens if more than half the elements (with some padding for small arrays) won’t be used as the result of the operation.

Returning to our REPL session, PACKED_SMI_ELEMENTS kind in our array assumes no holes, but if we change it in a certain way, the kind will transition into a less specific one. Let’s do it:

> arr[2] = 0;
> %DebugPrint(arr);
...
 - elements: 0x0e61bd5e7501 <FixedArray[17]> [HOLEY_SMI_ELEMENTS]
 - length: 3
...
 - elements: 0x0e61bd5e7501 <FixedArray[17]> {
           0: 42
           1: 0x357222481669 <the_hole>
           2: 0
        3-16: 0x357222481669 <the_hole>
 }

Here we assigned the second item of the array, skipping the first one which contained the hole. As a result, the array’s elements kind transitioned to HOLEY_SMI_ELEMENTS. This kind assumes that the array contains only Smis or holey values. In terms of performance, this elements kind is slightly slower than the packed one as V8 has to perform value checks to skip holes when iterating the array or modifying it.

We’re not going to experiment any further with other element kinds backed by arrays. This is left as an exercise for curious readers. Nevertheless, it makes sense to mention that V8 optimizes for arrays of 64-bit floating-point numbers: PACKED_DOUBLE_ELEMENTS and HOLEY_DOUBLE_ELEMENTS kinds store numbers in the backing array, avoiding on-heap pointers for each number.

What we’re interested in as the next step is knowing whether the backing store used for array items can be different from a fixed-size array. Let’s do one more experiment in our REPL session:

> arr[32 << 20] = 0;
> %DebugPrint(arr);
...
 - elements: 0x10f6026db0d9 <NumberDictionary[16]> [DICTIONARY_ELEMENTS]
 - length: 33554433
...
 - elements: 0x10f6026db0d9 <NumberDictionary[16]> {
   - max_number_key: 33554432
   2: 0 (data, dict_index: 0, attrs: [WEC])
   0: 42 (data, dict_index: 0, attrs: [WEC])
   33554432: 0 (data, dict_index: 0, attrs: [WEC])
 }
...

What just happened? Our array no longer uses an array-based backing store and instead, it uses a NumberDictionary[16], which is a hash table-based collection specialized for number keys. If you’re interested in additional details, the hash table uses open addressing with quadratic probing.

Elements kind also transitioned to DICTIONARY_ELEMENTS which means “slow” path for JS arrays. With this kind, V8 aims to reduce the memory footprint for sparse arrays with a lot of holes, as the hash table only stores non-hole array elements. On the other hand, hash table operations are slower than an array as we need to pay for the cost of hash code calculation, entry lookup, and rehashing. A bit later we’re going to do some microbenchmarking to understand the cost.

The dictionary kind is used for arrays larger than 32 * 2²⁰ (~33.5M), so that’s why our array transitioned into this kind once we hit the limit. In terms of memory, this means that an array-baked JS array can not grow beyond ~268MB.

As for dictionary-based arrays, the maximum size for them is restricted by the ECMAScript specification and can not exceed the maximum value of a 32-bit unsigned integer (2³² — 1).

Great. Now, when we have a better understanding of how V8 handles JS arrays, let’s do some benchmarking.

Some Silly Benchmarks

Before we go any further I need to warn you that the following microbenchmarks are totally non-scientific, unfair benchmarks, so take them with a grain of salt. Benchmarks were done on my dev machine with i5–8400H CPU, Ubuntu 20.04, and Node.js v15.11.0.

First, let’s try to understand the difference between different element kinds in terms of array iteration. In the first benchmark, we iterate over an array of numbers and simply calculate the total sum of its elements. The results are visualized below.

Array iteration benchmark results.

Here the result for dictionary kind is barely visible as it’s two orders of magnitude smaller than the one for packed kind. As for the holey kind, it’s only 23% slower than the packed one.

Now let’s do some measurements for basic mutation operations, like push() and pop(). In the second benchmark, we push 1K elements into the array, then pop all of them on each iteration. The results are below.

Push/pop benchmark results.

This time the dictionary kind result is not even visible (and, yes, I’m awful at data visualization) since it’s ~200 versus ~238K operations per second for array-based kinds.

Interestingly, if we disable JIT in V8 with the --jitless flag, the result becomes ~200 versus ~16K operations per second. This clearly shows how good V8 JIT is at optimizing loops for array-based kinds.

While the absolute numbers don’t matter, the above results illustrate that your JS application should avoid dealing with dictionary-based arrays, unless you absolutely have to.

It’s time to wrap up and list our today findings.

Summary

Each JS array is associated with an element kind, metadata tracked by V8 to optimize array operations. These kinds describe types of elements stored in the array.
Elements of small enough arrays are stored in an internal fixed-size array. V8 allocates some extra space in the internal array to achieve constant amortized time for push() and similar operations that grow the array. When the array length decreases, the internal array may also shrink.
Once a JS array becomes large (this also includes holey arrays), V8 starts using a hash table to store the array elements. The array is now associated with the “slow” dictionary elements kind.
For hot loops, the “slow” kind may be multiple orders slower than array-based kinds.
V8 JIT is good at optimizing loops for array-based kinds.
In general, when writing code that manipulates large arrays on the hot path, you should let V8 use the most specific elements kind for your arrays.

Thanks for reading this post. Please let me know if you have ideas for the next posts in V8 Deep Dives series. Feedback on inconsistencies or incorrect assumptions is also more than welcome.

Our Journey to a High-Performance Node.js Library

Andrei Pechkurov — Thu, 19 Nov 2020 06:47:15 +0000

As you may already know, the Hazelcast In-Memory Data Grid (IMDG) ecosystem includes a variety of clients for different languages and runtimes, which includes Node.js client library as a part of that list.

You can use Hazelcast clients in various cases, including, but not limited to the following:

Building a multi-layer cache for your applications with IMap, a distributed, replicated key-value store, and its NearCache.
Enabling pub-sub communication between application instances.
Dealing with high load for views or likes events by using a conflict-free replicated counter.
Preventing races when accessing 3rd-party services by using FencedLock and other distributed concurrency primitives available in Hazelcast CP Subsystem (powered by Raft consensus algorithm).

High performance and low latency for data access have always been a key feature of Hazelcast. So, it’s not surprising that we put a lot of time and effort into optimizing both server-side and client libraries.

Our Node.js library went through numerous performance analysis and optimization runs over the course of several releases, and we think it’s worth telling you the story and sharing the gathered experience. If you develop a library or an application for Node.js and performance is something you care about, you may find this blog post valuable.

TL;DR

Performance analysis is not a one-time action but rather a (sometimes tiring) process.
Node.js core and the ecosystem includes useful tools, like the built-in profiler, to help you with the analysis.
Be prepared for the fact that you will have to throw many (if not most) of your experiments into the trash as part of the optimization process.
While “high-performance library” title may sound too loud, we do our best to deserve it for Node.js and all the other Hazelcast client libraries.

We’re going to start this story in spring 2019, in the times of 0.10.0 version of the Node.js client. Back then, the library was more or less feature complete, but there was little understanding of its performance. Obviously, it was necessary to analyze the performance before the first non-0.x release of the client and that’s where this story starts.

Benchmarks

It’s not a big secret that benchmarking is tricky. Even VMs themselves may introduce noticeable variation in results and even fail to reach a steady performance state. Add Node.js, library, and benchmark code on top of that and the goal of reliable benchmarking will get even harder. Any performance analysis has to rely on inputs provided by some kind of benchmark. Luckily, version 0.10.0 of the library included a simple benchmark used in early development phases. That benchmark had some limitations which needed to be resolved before going any further.

The existing benchmark supported only a single scenario with randomly chosen operations. There is nothing wrong with having a random-based scenario in the benchmark suite, but only when more narrow scenarios are present in the suite. In the case of a client library, that would be “read-heavy” and “write-heavy” scenarios. The first assumes sending lots of read operations, thus moving the hot path to the I/O read-from-socket code and further data deserialization. You may have already guessed that the second scenario involves lots of writes and moves write-to-socket and serialization code to the hot path. So, we added these additional scenarios.

Another noticeable addition to scenarios was support for the payload size option. Variation in payload size is important when running benchmarks, as it helps with finding potential bottlenecks in the serialization code. Using different payload types is also valuable, but for a start, we decided to deal with strings only. String type is used for storing JSON data on the Hazelcast cluster, so our choice had a nice side-effect of testing a significant part of the hot path for JSON payload type (i.e., for plain JavaScript objects).

The second problem was self-throttling of the benchmark. Simply put, the benchmark itself was acting as a bottleneck hiding real bottleneck issues present in the client library. Each next operation run by the benchmark was scheduled with the setImmediate() function without any concurrency limit for the sent operations. Apart from becoming a bottleneck, this approach also created a significant level of noise (sometimes it’s called “jitter”) in the benchmark results. Even worse, such logic puts the benchmark very far from real-world Node.js applications.

That’s why we improved the benchmark by enforcing the given concurrency limit. The end behavior of our benchmark runner is close to the popular p-limit package and can be visualized as the following diagram:

The diagram shows how operations are executed when the concurrency limit is set to 3 and the total count of operations to be run is 7. As a result, the load put on both the client and the server-side instances is evenly distributed, which helps to minimize the jitter.

Finally, we added a warm-up phase into the benchmark to give both client and server VMs some time to reach a steady state.

Now, with our new shiny benchmark, we were ready to start the actual analysis.

Here Come the Bottlenecks

The very first benchmark run showed the following results in scenarios based on IMap’s get() (“read-heavy”) and set() (“write-heavy”) operations.

Scenario	get() 3B	get() 1KB	get() 100KB	set() 3B	set() 1KB	set() 100KB
Throughput (ops/sec)	90,933	23,591	105	76,011	44,324	1,558

Each result here stands for an average throughput calculated over a number of benchmark runs. Result variation, median and outliers are omitted for the sake of brevity, but they were also considered when comparing results.

Data sizes (3B, 1KB, and 100KB) in the table stand for the value size. Of course, absolute numbers are not important here, as we didn’t yet have a baseline. Still, the results for the smallest value size look more or less solid and, if we would only run these benchmarks, we could stop the analysis, give the library a green light for the first major release, and arrange the release party. But results for larger values are much more disturbing. They scale down almost linearly with the growth of the value size, which doesn’t look good. This gave us a clue that there was a bottleneck somewhere on the hot path, presumably in the serialization code. Further analysis was required.

Node.js is quite mature and there are a number of tools in the ecosystem to help you with finding bottlenecks. The first one is the V8’s sampling profiler exposed by Node.js core. It collects information about call stacks in your application with a constant time interval and stores it in an intermediate profile file. Then it allows you to prepare a text report based on the profile. The core logic is simple: the more samples contain a function on the top of the call stack, the more time was spent in the function when profiling. Thus, potential bottlenecks are usually found among the most “heavy” functions.

Profiler reports are helpful in many situations, but sometimes you may want to start the analysis with visual information. Fortunately, flame graphs are there to help. There are a number of ways to collect flame graphs for Node.js applications, but we were more than fine with 0x library.

Here is a screenshot of the flame graph collected for the set() 3B scenario.

This screenshot is static, while 0x produces an interactive web page allowing you to zoom and filter through the contents of the flame graph. In this particular case, it took us some time to iterate over so-called “platos” in search of suspicious calls. Finally, we found a good candidate highlighted in the next picture.

It appeared that the library was doing a lot of unnecessary allocations for Buffer objects. Buffers are low-level objects based on V8’s ArrayBuffer class, which represents contiguous arrays of binary data. The actual data is stored off-heap (there are some exceptions to this rule, but they are not relevant for our case), so allocating a Buffer may be a relatively expensive operation.

As a simple fix, we tried to get rid of certain Buffer allocations happening in the library by doing those allocations in a greedy manner. With this change, the benchmark showed us the following.

	get() 3B	get() 1KB	get() 100KB	set() 3B	set() 1KB	set() 100KB
v0.10.0	90,933	23,591	105	76,011	44,324	1,558
Candidate	104,854	24,929	109	95,165	52,809	1,581
	+15%	+5%	+3%	+25%	+19%	+1%

The improvement was noticeable for smaller payloads, but the scalability issue was still there. While the fix was very simple, if not primitive, the very first bottleneck was found. The fix was good enough as the initial optimization and further improvements were put into the backlog for future versions of the library.

The next step was to analyze so-called “read-heavy” scenarios. After a series of profiler runs and a thoughtful analysis, we found a suspicious call. The call is highlighted on the following screenshot for get() 100KB flame graph.

The ObjectDataInput.readUtf() method appeared to be executed on a significant percentage of collected profiler samples, so we started looking into that. The method was responsible for string deserialization (i.e., creating a string from the binary data) and looked more or less like the following TypeScript code.

private readUTF(pos?: number): string {
  const len = this.readInt(pos);
  // ...
  for (let i = 0; i < len; i++) {
    let charCode: number;
    leadingByte = this.readByte(readingIndex) & MASK_1BYTE;
    readingIndex = this.addOrUndefined(readingIndex, 1);
    const b = leadingByte & 0xFF;
    switch (b >> 4) {
      // ...
    }
    result += String.fromCharCode(charCode);
  }
  return result;
}

In general, the method was similar to what we had in the Hazelcast Java client. It was reading UTF-8 chars one by one and concatenating the result string. That looked like a suboptimal code, considering that Node.js provides the buf.toString() method as a part of the standard library. To compare these two implementations, we wrote simple microbenchmarks for both string deserialization and serialization. Here is a trimmed result for the serialization microbenchmark.

As it is clearly seen here, the standard API is significantly (around x6) faster than our custom implementation when it comes to ASCII strings (which are a frequent case in user applications). Results for deserialization and other scenarios look similar with the respect to the string size correlation. That was the exact reason for the scalability issue.

The standard library is significantly faster in the ASCII string case, as V8 is smart enough to detect the case and go over the fast path where it simply copies string contents instead of decoding/encoding individual chars. For those of you who are curious about the corresponding V8 source code, here is the place responsible for the buf.toString()’s fast path.

Anyhow, before making the final verdict, it was necessary to confirm the hypothesis with a proper experiment. To do so, we implemented a fix and compared it with the baseline (v0.10.0).

	get() 3B	get() 1KB	get() 100KB	set() 3B	set() 1KB	set() 100KB
v0.10.0	90,933	23,591	105	76,011	44,324	1,558
Candidate	122,458	104,090	7,052	110,083	73,618	8,428
	+34%	+341%	+6,616%	+45%	+66%	+440%

Bingo! Lesson learned: always bet on the standard library. Even if it’s slower today, things may change dramatically in the future releases.

As a result of this short (~1.5 weeks) initial analysis, Hazelcast Node.js client v3.12 was released with both of the discussed performance improvements.

Now, when there is an understanding of our usual process, let’s speed up the narration and briefly describe optimizations shipped in later versions of the library.

Automated Pipelining

Protocol pipelining is a well-known technique used to improve the performance of blocking APIs. On the user level, it usually implies an explicit batching API, which is only applicable to a number of use cases, like ETL pipelines.

Obviously, the same approach can be applied to Node.js with its non-blocking APIs. But we wanted to apply the technique in an implicit fashion so that most applications would benefit from the new optimization. We ended up with the feature called automated pipelining. It can be illustrated with the following diagram.

The main idea is to accumulate outbound messages based on the event loop lifecycle instead of writing them into a TCP socket immediately when the user starts an operation. The messages are scheduled to be concatenated into a single Buffer (with a configured size threshold) and only then are written into the socket. This way we benefit from batch writes without having to ask the user to deal with an explicit pipelining API.

Another important aspect here is that the client keeps one persistent connection per cluster member (note: we’re talking of smart client mode). Consequently, network communication over each connection is intensive enough to make the described batching logic valuable in terms of throughput.

Hazelcast Java client implements something close to this optimization by concatenating messages before writing them into the socket. A similar approach is used in other Node.js libraries, like DataStax Node.js driver for Apache Cassandra.

Benchmark measurements for automated pipelining showed 24-35% throughput improvement in read and write scenarios. The only drawback was a certain degradation (~23%) in scenarios with large message writes (100KB), which is expected considering the nature of the optimization. As real-world applications read data more frequently than write it, it was decided to enable automated pipelining by default and allow users to disable it via the client configuration.

Later on, we have improved automated pipelining by optimizing the code, which was manipulating the write queue. The main improvement came from reusing the outbound Buffer instead of allocating a new one on each write. Apart from this, we also were able to get rid of the remaining unnecessary Buffer allocations that we had in the library. As a result, we got around 8-10% throughput improvement. This latest version of automated pipelining may be found in the 4.0 release of the client.

Boomerang Backups

As you may guess, it’s not all about Node.js specific optimizations. Periodically, all Hazelcast clients get common optimizations. Client backup acknowledgments (a.k.a. boomerang backups) are a recent example of this process.

Previously, the client was waiting for the sync backups to complete on the member. This was causing 4 network hops to complete a client operation with sync backup. Since sync backup configuration is our out-of-the-box experience, boomerang backups optimization was introduced. The following diagram illustrates the change in terms of client-to-cluster communication.

As it may be seen above, boomerang backups decrease network hops to 3. With this change, we saw up to 30% throughput improvement in our tests. This optimization was shipped in client v4.0.

Migration to Native Promises

Everyone knows that callbacks lost the battle and most Node.js applications are written with promises. That’s why Hazelcast Node.js client had a Promise-based API from the day one. In older versions, it was using the bluebird Promise library for performance reasons. But since then, V8’s native Promise implementation got much faster and we decided to give native promises a try.

Benchmark measurements showed no performance regression after the migration, so the switch was shipped in v4.0. As a nice side effect of this change, we got an out-of-the-box integration with async_hooks module.

Other Optimizations

Expectedly, there were a bunch of smaller optimizations done on the way. Say, to reduce the amount of litter generated on the hot path we switched from new Date() calls to Date.now(). Another example is the default serializer implementation for Buffer objects. It allows users to deal with Buffers instead of plain arrays of numbers. Not saying that the internal code responsible for manipulations with Buffers also improved a lot. It’s hard to notice an effect of individual optimization here, but they’re certainly worth it.

A Self-Check

Before the wrap-up, let’s try to look at what we achieved in about one year. To do so, we’re going to run a couple of benchmarks for versions 0.10.0 (our baseline) and 4.0 (the latest one).

For the sake of brevity we’re going to compare IMap.set() and get() operations for 1KB ASCII values. Hopefully, the payload is close enough to what one may see on average in Node.js applications. Here is how the result looks like.

In the above chart, we see almost x3 throughput improvement in both operations. The value of all implemented optimizations should be obvious now.

What’s Next?

There are multiple things we want to give a try in both the library and the tooling. For instance, we’re experimenting with the onread option available in the net.Socket class. This option allows one to reuse Buffer when reading from the socket. Unfortunately, tls module used by the client for encrypted communication lacks the counterpart option, so recently we contributed to the Node.js core to improve things.

Our benchmarking approach also needs some improvements. First of all, we want to start considering operation latency by collecting latency data into an HDR histogram throughout benchmark execution. Another nice addition would be integration with Hazelcast Simulator, our distributed benchmarking framework. Finally, support for more data structures and payload types won’t hurt.

Lessons Learned

Yes, we know that the “high-performance library” title may be too loud, but we do our best to deserve it. For us, as open-source library maintainers, performance analysis is a process that requires constant attention. Necessary routing actions, like pre-release performance analysis, may be tiring. We had to throw many (if not most) of our experiments into the trash can. But in the end, performance is something we aim to deliver in all of our client libraries.

Hazelcast Node.js Client 4.0 is Released

Andrei Pechkurov — Fri, 02 Oct 2020 07:10:20 +0000

Hazelcast Node.js client 4.0 is now available! Let’s see what are the main changes in this new release.

Hazelcast Client Protocol 2.0

Node.js client now uses Hazelcast Open Binary Client Protocol 2.0, which has a number of enhancements and serialization improvements when compared with 1.x. For the end-user, it means that the client now supports IMDG 4.0+. Also, note that you cannot use a 4.0 client with IMDG 3.x members.

Ownerless Client

In Hazelcast 3.x, clients were implicitly assigned to an owner member responsible for cleaning up their resources after they leave the cluster. Ownership information had to be replicated to the whole cluster when a client joined the cluster. The “owner member” concept is now removed and Node.js client 4.0 acts as an ownerless client, which is a simpler solution for the problem allowing to remove the extra step.

Configuration Redesign and API Cleanup

Programmatic configuration in client 4.0 has become simpler and does not require boilerplate code anymore. The configuration itself is now represented with a plain JavaScript object.

Programmatic configuration (old way):

const { Client, Config } = require('hazelcast-client');

// Create a configuration object
const clientConfig = new Config.ClientConfig();

// Customize the client configuration
clientConfig.clusterName = 'cluster-name';
clientConfig.networkConfig.addresses.push('10.90.0.2:5701');
clientConfig.networkConfig.addresses.push('10.90.0.3:5701');
clientConfig.listeners.addLifecycleListener(function (state) {
    console.log('Lifecycle Event >>> ' + state);
});

// Initialize the client with the given configuration
const client = await Client.newHazelcastClient(clientConfig);

Programmatic configuration (new way):

// No need to require Config anymore
const { Client } = require('hazelcast-client');

// Initialize the client with the configuration object (POJO)
const client = await Client.newHazelcastClient({
    clusterName: 'cluster-name',
    network: {
        clusterMembers: [
            '10.90.0.2:5701',
            '10.90.0.3:5701'
        ]
    },
    lifecycleListeners: [
        (state) => {
            console.log('Lifecycle Event >>> ' + state);
        }
    ]
});

The “shape” of the configuration is kept close to the old declarative configuration API and to the Java client’s YAML/XML configuration. So, the user experience is the same across other Hazelcast clients, but it is also native to JavaScript and Node.js runtime.

The old declarative configuration API was removed as it does not make a lot of sense now, considering these changes.

The 4.0 release also brings a number of changes aimed to make the API more idiomatic for JavaScript and familiar to Node.js developers.

CP Subsystem Support

In Hazelcast 4.0, concurrent primitives moved to CP Subsystem. CP Subsystem contains new implementations of Hazelcast’s concurrency APIs on top of the Raft consensus algorithm. As the name of the module implies, these implementations are CP with respect to the CAP principle and they live alongside the AP data structures in the same Hazelcast IMDG cluster. They maintain linearizability in all cases, including client and server failures, network partitions, and prevent split-brain situations.

Node.js client 4.0 supports all data structures available in the CP Subsystem, such as AtomicLong, AtomicReference, FencedLock, Semaphore, and CountDownLatch. Here is how a basic FencedLock usage looks like:

// Get a FencedLock called 'my-lock'
const lock = await client.getCPSubsystem().getLock('my-lock');
// Acquire the lock (returns a fencing token)
const fence = await lock.lock();
try {
    // Your guarded code goes here
} finally {
    // Make sure to release the lock
    await lock.unlock(fence);
}

Backup Acknowledgments

In previous versions, the client was waiting for the sync backups to complete on the member. This was causing 4 network hops to complete a client operation with sync backup. Since sync backup configuration is our out-of-the-box experience, we improved its performance. Backup acknowledgments (a.k.a. boomerang backups) design decreases network hops to 3, thus improving the throughput up to 30%.

Improved Performance

We did a number of experiments and optimizations leading to improved performance for writes by 5-10%.

Other Changes

You can see the list of all changes in this version in the release notes.

What’s Next?

We believe the Node.js client has the capabilities to cover most of your use cases. Next, we are planning to work on integrations with well-known Node.js libraries! Here are the top items in our backlog:

Hazelcast session store for popular Node.js web frameworks: A session store backed by Hazelcast IMDG.
Hazelcast cache adapters for popular ORMs: Hazelcast integration with the Sequelize framework, a promise-based Node.js ORM for SQL databases.
Blue/Green Deployments: Ability to divert the client automatically to another cluster on demand or when the intended cluster becomes unavailable.
Full SQL support: Once the SQL feature in Hazelcast graduated from the beta status, we are going to add it to Node.js client.

You can always check the Hazelcast Node.js client roadmap for an up-to-date list of features in our backlog.

Hazelcast Node.js client 4.0 is available on npm. We look forward to hearing your feedback on our Slack, Stack Overflow, or Google groups. If you would like to introduce some changes or contribute, please visit our Github repository.

Storing Time Series in RocksDB: A Cookbook

Andrei Pechkurov — Tue, 22 Sep 2020 14:56:12 +0000

Photo by Aron Visuals on Unsplash

Common wisdom says that K/V stores are a bad fit for time series (TS) data. Reasons are lots of writes and a large data volume implied by time series. But the common wisdom maybe sometimes wrong. Today we will discuss an approach for building a (relatively) efficient TS storage on top of RocksDB, an embedded key/value (K/V) store from Facebook. RocksDB is a perfect fit for our needs, as it’s production-ready, well-maintained, and provides solid write speed thanks to the LSM tree data structure. A TS storage built with the approach should be capable of showing good read/write throughput, as well as a decent data compression ratio.

Disclaimer. Before we go any further, I need to say that in situations when you can use an existing TS database, you certainly should go for it. But sometimes you might need an embedded TS storage and then the list of options is almost empty.

This blog post is a short, napkin writing style summary of the approach, which is described with a greater amount of detail in this talk I gave some time ago. While that talk was focused on the needs of Hazelcast Management Center and included some code snippets in Java, I’m going to keep this post as short and language-agnostic as possible.

My ultimate goal is to shed enough light on the idea so that any developer can build an embedded TS storage on top of RocksDB (or any other K/V store) in any popular programming language.

Let’s start with the terms and assumptions necessary for further explanation. For the sake of readability, I’m going to use TypeScript for the code snippets.

Terminology

“Metric” — a numerical value that can be measured at a particular time and has real-world meaning. Examples: CPU load, used heap memory. Uniquely characterized by a name and set of tags.

“Data point” (aka “sample”) — a metric value measured at the given time. Characterized by a metric, timestamp (Unix time), and value.

“Time series” — a series of data points that belong to the same metric and have monotonically increasing timestamps.

A data point could be described by the following interface.

Here is how a real-world data point can look like.

The storage itself can be expressed with the following interface.

As you may see, the public API is very simple and provides methods for storing a list of data points and querying a single time series (i.e. a single metric). That is why the term “storage” is used here: it’s not a TS database, as it lacks query language and execution engine powered by indexes for metric metadata, aggregation API, as well as some other modules. On the other hand, many of those features can be built on top of the storage, but that goes beyond our today’s topic.

Assumptions

All TS databases and storages make assumptions for the shape of the time series data and so do we.

Firstly, we assume that timestamps have one-second granularity, i.e. each timestamp corresponds to a start of a second. As we will see later, this assumption opens up opportunities for bitmap-based data compression.

Secondly, data point values have to be integer numbers. This assumption also allows us to use certain compression techniques, but this one is not a must-have and it’s possible to support float numbers by using a different compression algorithm.

Now, when we have a better understanding of our needs, let’s discuss K/V data layout.

Key-value layout

The main idea here is very natural for time series and may be found in most (if not all) TS DBs. In order to reduce the number of K/V entries and enable ways to compress data, we store time series in minute buckets. The choice of a minute as the time interval is a compromise between durability (smaller interval, more frequent on-disk persistence) and overall storage efficiency (more data points in a bucket, better compression and read/write throughput).

As the second step, we also extract metrics into a separate K/V store (RocksDB calls them “databases”) with assigned integer identifiers. This way each data point record in the main store contains identifiers instead of lengthy metric descriptions.

The layout of the main K/V store which holds minute buckets looks like the following.

Here the value holds a byte buffer (a blob) with the compressed minute bucket data.

As you might have guessed already, the next topic is data compression.

Data point compression

In our design, each minute bucket may hold up to 60 data points. In reality, some of them may be missing or the data collection interval may be bigger than one second. The most straightforward approach is to store the whole bucket in a number array of 60 elements and, maybe, compress it with one or another algorithm. But we should try to avoid storing “holes”, so we need something smarter.

As we already store timestamps of minute starts in keys, for each data point we only need to store the offset of the corresponding minute start. And here comes bitmap encoding, the first compression technique we shall use. We only need 64 bits to encode all offsets. Let’s call this part of the blob “data point layout”.

As for the values (the “payload”), we can put them in an array and compress it with a simple algorithm that combines delta encoding and run-length encoding. We are going to leave all details of the exact algorithm out of scope, but if you want to learn more, this blog post would be a good starting point.

This picture illustrates the layout of each minute bucket.

It is time to discuss how these building blocks work together as a whole.

TS storage design

Let’s look at how the implementation may look like from the high-level design perspective.

Here Metrics Registry stands for the K/V store containing metric-to-identifier pair, Persistent Store holds minute buckets, and In-Memory Store serves as a write-behind cache.

Each write goes through the following flow.

On each store() call data points are accumulated in minute buckets stored in the cache, while on-disk persistence happens periodically in a background job. Each run of the job iterates over all cached buckets and persists those which were accumulated for at least one minute. As the last piece of the puzzle, the compression should be applied before persisting buckets.

This way we group data points into buckets and benefit from both data compression and batch writing.

As shown below, the read flow time series is a way simpler.

When reading data points, we first check Metrics Registry to find the identifier (hint: an in-memory cache will certainly speed up this step). Then we check In-Memory Store trying to find the bucket and only then we read it from the disk (from RocksDB, to be more precise). This way we make sure that queries for recent data points will be fast, which is valuable for many use cases, like showing the latest data in UI.

That is basically the whole approach. Congrats on your new knowledge!

While the theory is valuable, no doubt, some of you might want to know the characteristics of a concrete implementation of the TS storage.

Potential characteristics

Some time ago we have developed an embedded TS storage for Management Center, which is a cluster management & monitoring application for Hazelcast IMDG. The storage was written in Java on top of RocksDB and followed the described design.

Along with other types of testing, we benchmarked the storage in order to understand its performance characteristics. The benchmark was emulating 10 cluster nodes reporting 120K metrics every 3 seconds. Random integer numbers from the 0–1,000 range were used as data point values. As the result, on a decent HW with SATA III connected SSD the storage showed a throughput of 400K data point writes per second and 19K minute series (physical) reads per second. These numbers can be called good enough, at least for our needs.

As for the compression efficiency, on average the storage spends around 5.25 bytes per data point. Raw data point data takes 16 bytes, so the compression ratio is around x3. This result is worse than what standalone TS DBs usually show, but it is not terrible and may be sufficient for many use cases, especially when considering how simple the storage is.

Summary

I’m intentionally leaving additional challenges, like potential enhancements and technical restrictions, out of the scope of this blog post, but you may find them mentioned in the talk. Also, the design is not carved in the stone and leaves a lot of room for variations.

My main intent is to show that a simple, yet good enough TS storage can be easily built on top of RocksDB or any other suitable K/V store. So, if you plan to write one, then I wish you good luck and fun coding!

[V8 Deep Dives] Understanding Map Internals

Andrei Pechkurov — Thu, 27 Aug 2020 13:56:33 +0000

Photo by Julian Paul on Unsplash

With this blog post, I am starting V8 Deep Dives series dedicated to my experiments and findings in V8, which is, no doubt, a well-engineered and sophisticated software. Hopefully, you will find this blog post valuable and share your ideas for the next topic.

Intro

ECMAScript 2015, also known as ES6, introduced many built-in collections, such as Map, Set, WeakMap, and WeakSet. They appeared to be an excellent addition to the standard JS library and got widely adopted in libraries, applications, and Node.js core. Today we are going to focus on Map collection and try to understand V8 implementation details, as well as make some practical conclusions.

The spec does not dictate a precise algorithm used to implement Map support, but instead gives some hints for possible implementations and expected performance characteristics:

Map object must be implemented using either hash tables or other mechanisms that, on average, provide access times that are sublinear on the number of elements in the collection. The data structures used in this Map objects specification is only intended to describe the required observable semantics of Map objects. It is not intended to be a viable implementation model.

As we see here, the spec leaves a lot of room for each implementer, i.e., JS engine, but does not give a lot of certainty on the exact algorithm, its performance, or memory footprint of the implementation. If your application deals with Maps on its hot path or you store a lot of data in a Map, such details may be certainly of great help.

As a developer with a Java background, I got used to Java collections, where one can choose between multiple implementations of Map interface and even fine-tune it if the selected class supports that. Moreover, in Java it is always possible to the open source code of any class from the standard library and get familiar with the implementation (which, of course, may change across versions, but only in a more efficient direction). So, that is why I could not stand not to learn how Maps work in V8.

Now, let’s start the dive.

Disclaimer. What’s written below is implementation details specific to V8 8.4 bundled with a recent dev version of Node.js (commit 238104c to be more precise). You should not expect any behavior beyond the spec.

Underlying Algorithm

First of all, Maps in V8 are built on top of hash tables. The subsequent text assumes that you understand how hash tables work. If you are not familiar with the concept, you should learn it first (e.g., by reading this wiki page) and then return here.

If you have substantial experience with Maps, you might already notice a contradiction here. Hash tables do not provide any order guarantees for iteration, while ES6 spec requires implementations to keep the insertion order while iterating over a Map. So, the “classical” algorithm is not suitable for Maps. But it appears that it is still possible to use it with a slight variation.

V8 uses the so-called deterministic hash tables algorithm proposed by Tyler Close. The following TypeScript-based pseudo-code shows main data structures used by this algorithm:

Here CloseTable interface stands for the hash table. It contains hashTable array, which size is equal to the number of buckets. The Nth element of the array stands for the Nth bucket and holds an index of the bucket’s head element in the dataTable array. In its turn, dataTable array contains entries in the insertion order. Finally, each Entry has chain property, which points to the next entry in the bucket’s chain (or singly linked list, to be more precise).

Each time when a new entry is inserted into the table, it is stored in the dataTable array under the nextSlot index. This process also requires an update in the chain of the corresponding bucket, so the inserted entry becomes the new tail.

When an entry is deleted from the hash table, it is removed from the dataTable (e.g., with = undefined). As you might notice, this means that all deleted entries still occupy space in the dataTable.

As the last piece of the puzzle, when a table gets full of entries (both present and deleted), it needs to be rehashed (rebuilt) with a bigger (or smaller) size.

With this approach, iteration over a Map is just a matter of looping through the dataTable. That guarantees the insertion order requirement for iteration. Considering this, I expect most JS engines (if not all of them) to use deterministic hash tables as the building block behind Maps.

Algorithm in Practice

Let’s go through more examples to see how the algorithm works. Say, we have a CloseTable with 2 buckets (hashTable.length) and total capacity of 4 (dataTable.length) and the hash table is populated with the following contents:

In this example, the internal table representation can be expressed like the following:

If we delete an entry by calling table.delete(1), the table turns into this one:

If we insert two more entries, the hash table will require rehashing. We will discuss this process in more detail a bit later.

The same algorithm can be applied to Sets. The only difference is that Set entries do not need value property.

Now, when we have an understanding of the algorithm behind Maps in V8, we are ready to take a deeper dive.

Implementation Details

The Map implementation in V8 is written in C++ and then exposed to JS code. The main part of it is defined in OrderedHashTable and OrderedHashMap classes. We already learned how these classes work, but if you want to read the code yourself, you may find it here, here, and, finally, here.

As we are focused on the practical details of V8’s Map implementation, we need to understand how table capacity is selected.

Capacity

In V8, hash table (Map) capacity is always equal to a power of two. As for the load factor, it is a constant equal to 2, which means that max capacity of a table is 2 * number_of_buckets. When you create an empty Map, its internal hash table has 2 buckets. Thus the capacity of such a Map is 4 entries.

There is also a limit for the max capacity. On a 64-bit system that number would be 2²⁷, which means that you can not store more than around 16.7M entries in a Map. This restriction comes from the on-heap representation used for Maps, but we will discuss this aspect a bit later.

Finally, the grow/shrink factor used for rehashing is equal to 2. So, as soon as a Map gets 4 entries, the next insert will lead to a rehashing process where a new hash table of a twice as big (or less) size will be built.

To have a confirmation of what may be seen in the source code, I have modified V8 bundled in Node.js to expose the number of buckets as a custom buckets property available on Maps. You may find the result here. With this custom Node.js build we can run the following script:

The above script simply inserts 100 entries into an empty Map. It produces the following output:

As we see here, the Map grows as a power of two when map capacity is reached. So, our theory is now confirmed. Now, let’s try to shrink a Map by deleting all items from it:

This script produces the following output:

Again, we see that the Map shrinks as a power of two, once there are fewer remaining entries than number_of_buckets / 2.

Hash Function

So far, we did not discuss how V8 calculates hash codes for keys stored in Maps, while this is a good topic.

For number-like values (Smis and heap numbers, BigInts and other similar internal stuff), it uses one or another well-known hash function with low collision probability.

For string-like values (strings and symbols), it calculates hash code based on the string contents and then caches it in the internal header.

Finally, for objects, V8 calculates the hash code based on a random number and then caches it in the internal header.

Time Complexity

Most Map operations, like set or delete, require a lookup. Just like with the “classical” hash table, the lookup has O(1) time complexity.

Let’s consider the worst-case when the table has N out of N entries (it is full), all entries belong to a single bucket, and the required entry is located at the tail. In such a scenario, a lookup requires N moves through the chain elements.

On the other hand, in the best possible scenario when the table is full, but each bucket has 2 entries, a lookup will require up to 2 moves.

It is a well-known fact that while individual operations in hash tables are “cheap”, rehashing is not. Rehashing has O(N) time complexity and requires allocation of the new hash table on the heap. Moreover, rehashing is performed as a part of insertion or deletion operations, when necessary. So, for instance, a map.set() call could be more expensive than you would expect. Luckily, rehashing is a relatively infrequent operation.

Memory Footprint

Of course, the underlying hash table has to be somehow stored on the heap, in a so-called “backing store”. And here comes another interesting fact. The whole table (and thus, Map) is stored as a single array of fixed length. The array layout may be illustrated with the below diagram.

Specific fragments of the backing store array correspond to the header (contains necessary information, like bucket count or deleted entry count), buckets, and entries. Each entry of a bucket chain occupies three elements of the array: one for the key, one for the value, and one for the “pointer” to the next entry in the chain.

As for the array size, we can roughly estimate it as N * 3.5, where N is the table capacity. To have an understanding of what it means in terms of memory footprint, let’s assume that we have a 64-bit system, and pointer compression feature of V8 is disabled. In this setup, each array element requires 8 bytes, and a Map with the capacity of 2²⁰ (~1M) should take around 29 MB of heap memory.

Summary

Gosh, that was a long journey. To wraps things up, here is a shortlist of what we have learned about Maps in V8:

V8 uses deterministic hash table algorithm to implement Maps, and it is very likely that other JS engines do so.
Maps are implemented in C++ and exposed via JS API.
Just like with “classical” hash maps, lookups required for Map operations are O(1) and rehashing is O(N).
On a 64-bit system, when pointer compression is disabled, a Map with 1M entries occupies ~29 MB on the heap.
Most of the things described in this blog post can also be applied to Sets.

That’s it for this time. Please share your ideas for the next V8 Deep Dive.

One Node.js CLS API to rule them all

Andrei Pechkurov — Sat, 14 Mar 2020 19:35:44 +0000

Photo by Reign Abarintos on Unsplash

Node.js v13.10.0 introduced a built-in CLS API, namely, new class AsyncLocalStorage located in the well-known experimental async_hooks module. In this short post I’ll try to explain why it is so important.

CLS API 101

CLS stands for Continuation-Local Storage, an async variation of the Thread-Local Storage concept from the multithreaded world. CLS APIs allow you to associate and track contexts through asynchronous execution chains. Without such API, you have to either pass the context object explicitly (which sometimes is not an option), or deal with lots of monkey-patching for all async APIs in Node.js.

CLS APIs are often used in Application Performance Monitoring (APM) tools for Node.js available in various cloud platforms. But there are situations when you may want to use CLS in your app directly or transitively, i.e. it may be used in your dependencies. Request id tracking is one of such use cases. You may read more on this topic in my previous blog post.

But let’s return to AsyncLocalStorage. To give you an impression of the API, here is a example that shows how to use AsyncLocalStorage to build a primitive logger with request id tracing capabilities.

Why AsyncLocalStorage matters?

Now, when you have a better understanding of the subject, let’s go through a list of advantages of the new API.

There are several Node.js user-land modules that implement CLS, the most popular among which are continuation-local-storage and its successor cls-hooked. Unfortunately, none of them is widely adopted among the community and many Node.js libraries do not play well with these modules, leading to various context loss issues. Being a part of the standard library, AsyncLocalStorage has all chances to change that, as library maintainers will much more likely fix an issue related with a core API, than a 3-rd party one.
AsyncLocalStorage provides a simple, high-level API, which hides the complexity and low-level details of async hooks. Thus, it is more stable (in terms of compatibility) than APIs it builds upon and, potentially, may be moved into a separate, stable module.
Performance-wise AsyncLocalStorage is significantly faster than its user-land competitors (see the next section for benchmark results). Mainly, that’s because it avoids destroy hook usage (thanks to executionAsyncResource).

Performance Comparison

Disclaimer. This comparison is intentionally kept as simple as possible, i.e. it only considers mean results of benchmark run series collected on a dev machine.

As you may already know, Node.js core includes a set of various benchmarks maintained by the core team. One of those benchmarks (namely, async-resource-vs-destroy.js with necessary modifications) was used to compare AsyncLocalStorage with alternatives. This benchmark is aimed to simulate a more or less standard web app. To do that, it starts an HTTP server which schedules a setTimeout and then reads a file (all in the same CLS context) when handling incoming requests.

First of all, let’s see how AsyncLocalStorage compares with cls-hooked, one of the most popular user-land CLS APIs, in this scenario.

Benchmark against cls-hooked

As we see in the above diagram, AsyncLocalStorage is significantly faster than cls-hooked with both plain callbacks and async/await syntax.

Now, let’s try to measure the overhead of the new API when compared with the same web app with a no-op CLS implementation, i.e. an implementation that does nothing (and, thus, does not enable any async hooks).

Benchmark against a no-op CLS

As you may see, when your code is callback-based, the overhead is almost nothing. With async/await syntax it becomes more noticeable, but the penalty is not that high and there is some space for future improvements.

Call to Action

At this point you might be curious about maturity of AsyncLocalStorage and whether it is stable enough to be used in your app or library.

I’d say that the API needs some time to “settle down” (for instance, this PR involves noticeable changes) at this point (March 2020), but early birds are more than welcome.

Once AsyncLocalStorage becomes more stable API-wise, developers who deal with user-land CLS libraries, like cls-hooked, should definitely consider switching to the core module. Also, a backport to Node.js v12 is on its way, so that should help with broader adoption.

As a Node.js community member, I’d like to thank all core collaborators who made AsyncLocalStorage possible (especially, Vladimir de Turckheim who put a lot of energy into the API). On the other hand, as a co-author of this core CLS API, I promise to do my best to spread the word and advocate for AsyncLocalStorage to make Node.js, as a platform and an ecosystem a little bit better.

An Intro to Node.js That You May Have Missed

Andrei Pechkurov — Tue, 18 Dec 2018 17:01:00 +0000

Photo by Zachary Young on Unsplash

Everybody knows that Node.js is an open-source, cross-platform JavaScript runtime. Most of Node.js developers know that it’s built on top of V8, a JS engine, and libuv, a multi-platform C library that provides support for asynchronous I/O based on event loops. But only few developers can clearly explain how Node.js works internally and how it affects their code. That’s probably because many Node.js developers already know JavaScript before learning node. So, they often start learning node with Express.js, Sequelize, Mongoose, Socket.IO and other well-known libraries instead of investing their time in learning Node.js itself and its standard APIs. It seems to be the wrong choice to me, as understanding Node.js runtime and knowing specifics of built-in APIs may help to avoid many common mistakes.

This post does an intro to Node.js in a compact, yet comprehensive (hopefully) manner. We’re going to make a general overview of Node.js architecture. As the result, we’ll try to determine some guidelines for writing higher-performance, more-secure server-side web applications with Node.js. It should be helpful for Node.js beginners, as well as for experienced JS developers.

Main Building Blocks

Any Node.js application is built on top of the following components:

V8 — a Google’s open source high-performance JavaScript engine, written in C++. It is also used in Google Chrome browser and others. Node.js controls V8 via V8 C++ API.
libuv — a multi-platform support library with a focus on asynchronous I/O, written in C. It was primarily developed for use by Node.js, but it’s also used by Luvit, Julia, pyuv, and others. Node.js uses libuv to abstract non-blocking I/O operations to a unified interface across all supported platforms. This library provides mechanisms to handle file system, DNS, network, child processes, pipes, signal handling, polling and streaming. It also includes a thread pool, also known as Worker Pool, for offloading work for some things that cannot be done asynchronously at the OS level.
Other open-source, low-level components, mostly written in C/C++:
c-ares — a C library for asynchronous DNS requests, which is used for some DNS requests in Node.js.
http-parser — a lightweight HTTP request/response parser library.
OpenSSL — a well-known general-purpose cryptography library. Used in tls and crypto modules.
zlib — a lossless data-compression library. Used in zlib module.
The application — it’s your application’s code and standard Node.js modules, written in JavaScript.
C/C++ bindings — wrappers around C/C++ libraries, built with N-API, a C API for building native Node.js addons, or other APIs for bindings.
Some bundled tools that are used in Node.js infrastructure:
npm — a well-known package manager (and ecosystem).
gyp — a python-based project generator copied from V8. Used by node-gyp, a cross-platform command-line tool written in Node.js for compiling native addon modules.
gtest — Google’s C++ test framework. Used for testing native code.

Here is a simple diagram that shows main Node.js components that were mentioned in the list:

Main Node.js components

Node.js Runtime

Here is a diagram that shows how Node.js runtime executes your JS code:

Node.js runtime diagram (simplified)

This diagram does not show all details that are happening in Node.js, but it highlights the most important parts. We’re going to briefly discuss them.

Once your Node.js application starts, it first completes an initialization phase, i.e. runs the start script, including requiring modules and registering callbacks for events. Then the application enters the Event Loop (aka the main thread, event thread, etc.), which conceptually is built for responding to incoming client requests by executing the appropriate JS callback. JS callbacks are executed synchronously, and may use Node APIs to register asynchronous requests to continue processing after the callback completes. The callbacks for these asynchronous requests will also be executed on the Event Loop. Examples of such Node APIs include various timers (setTimeout(), setInterval(), etc.), functions from fs and http modules and many more. All of these APIs require a callback that will be triggered once the operation has finished.

The Event Loop is a single threaded and semi-infinite loop based on libuv. It’s called a semi-infinite loop because it quits at some point when there is no more work left to be done. From developer’s perspective, that’s the point when your program exits.

The Event Loop is pretty complex. It assumes manipulations with event queues and includes several phases:

Timers phase — this phase executes callbacks scheduled by setTimeout() and setInterval().
Pending callbacks phase — executes I/O callbacks deferred to the next loop iteration.
Idle and prepare phases — internal phases.
Poll phase — includes the following: retrieve new I/O events; execute I/O related callbacks (almost all with the exception of close, timers and setImmediate() callbacks); Node.js will block here when appropriate.
Check phase — setImmediate() callbacks are invoked here.
Close callbacks phase — some close callbacks are executed here, e.g. socket.on('close', ...).

Note. Check the following guide to learn more about Event Loop phases.

During the poll phase Event Loop fulfills non-blocking, asynchronous requests (started via Node APIs) by using libuv’s abstractions for OS-specific I/O polling mechanisms. These OS-specific mechanisms are epoll for Linux, IOCP for Windows, kqueue for BSD and MacOS, event ports in Solaris.

It’s a common myth that Node.js is single-threaded. In essence, it’s true (or it used to be partially true as there is an experimental support for web workers, called Worker Threads) as your JS code is always run on a single thread, within the Event Loop. But you may also notice the Worker Pool, which is a fixed size thread pool, on the diagram, so any Node.js process has multiple threads running in parallel. The reason for that is the following: not all Node API operations can be executed in a non-blocking fashion on all supported operation systems. Another reason for having the Worker Pool is that the Event Loop is not suited for CPU intensive computations.

So, Node.js (or libuv, in particular) does its best to keep the same asynchronous, event-driven API for such blocking operations and executes these operations on a separate thread pool. Here are some examples of such blocking operations in the built-in modules:

I/O-bound:
Some DNS operations in dns module: dns.lookup(), dns.lookupService().
Most of file system operations provided by fs module, like fs.readFile().
CPU-bound:
Some cryptographic operations provided by crypto module, like crypto.pbkdf2(), crypto.randomBytes() or crypto.randomFill().
Data compression operations provided by zlib module.

Notice that some 3rd-party native libraries, like bcrypt, also offload computations to the worker thread pool.

Now, when you should have a better understanding of Node.js overall architecture, let’s discuss some guidelines for writing higher-performance, more-secure server-side applications.

Rule #1 — Avoid Mixing Sync and Async In Functions

When you write any functions, you need to make them either completely synchronous or completely asynchronous. You should avoid mixing these approaches in a single function.

Note. If a function accepts a callback as an argument, it does not mean that it’s asynchronous. As an example, you can think of Array.forEach() function. Such approach is often called continuation-passing style (CPS).

Let’s consider the following function as an example:

const fs = require('fs')

function checkFile (filename, callback) {
  if (!filename || !filename.trim()) {
    // pitfalls are here:
    return callback(new Error('Empty filename provided.'))
  }

  fs.open(filename, 'r', (err, fileContent) => {
    if (err) return callback(err)

    callback(null, true)
  })
}

This function is quite simple, but it’s fine for our needs. The problem here is the return callback(...) branch, as the callback is invoked synchronously in case of an invalid argument. On the other hand, in case of a valid input, the callback is invoked in an async fashion, inside of the fs.open() call.

To show the potential issue with this code, let’s try to call it with different inputs:

checkFile('', () => {
  console.log('#1 Internal: invalid input')
})
console.log('#1 External: invalid input')

checkFile('main.js', () => {
  console.log('#2 Internal: existing file')
})
console.log('#2 External: existing file')

This code will output the following to the console:

#1 Internal: invalid input
#1 External: invalid input
#2 External: existing file
#2 Internal: existing file

You may have already noticed the problem here. The order of code execution is different in these cases. It makes the function non-deterministic, thus such style must be avoided. The function can be easily fixed to a completely async style by wrapping the return callback(...) call with setImmediate() or process.nextTick():

if (!filename || !filename.trim()) {
  return setImmediate(
    () => callback(new Error('Empty filename provided.'))
  )
}

Now our function became much more deterministic.

Rule #2 — Don’t Block the Event Loop

In terms of server-side web applications, e.g. RESTful services, all requests are processed concurrently within Event Loop’s single thread. So, for example, if processing of an HTTP request in your application spends significant amount of time on execution of a JS function that does a heavy calculation, it blocks the Event Loop for all other requests. As an another example, if your applications spends 10 milliseconds on processing of JS code for each HTTP request, throughput of a single instance of the application will be about 1000 / 10 = 100 requests per second.

Thus, the first golden rule of Node.js is “never block the Event Loop”. Here is a short list of recommendations that will help you to follow this rule:

Avoid any heavy JS calculations. If you have any code with time complexity worse than O(n), consider optimizing it or at least splitting calculations into chunks that are recursively called via a timer API, like setTimeout() or setImmediate(). This way you will not be blocking the Event Loop and other callbacks will be able to be processed.
Avoid any *Sync calls, like fs.readFileSync() or crypto.pbkdf2Sync(), in server applications. The only exception to this rule might be startup phase of your application.
Choose 3rd-party libraries wisely as they might blocking the event loop, e.g. by running some CPU intensive computations written in JS.

Rule #3 — Block the Worker Pool Wisely

It may be surprising, but the Worker Pool may be also blocked. As already know, it’s a fixed size thread pool with the default size of 4 threads. The size may be increased by setting UV_THREADPOOL_SIZE environment variable, but in many cases it won’t solve your problem.

To illustrate the Worker Pool problem let’s consider the following example. Your RESTful API has an authentication endpoint which calculates hash value for the given password and matches it with the value obtained from a database. If you did everything right the hashing is done on Worker Pool. Let’s imagine that each computation takes about 100 milliseconds in order to be finished. This means that with the default Worker Pool size you’ll get about 4*(1000 / 100) = 40 requests per second in terms of the hashing endpoint’s throughput (an important note: we’re considering the case of 4+ CPU cores here). While all threads in the Worker Pool are busy, all incoming tasks, such as hash computations or fs calls, will be queued.

So the second golden rule of Node.js is “block the Worker Pool wisely”. Here is a short list of recommendations that will help you to follow this rule:

Avoid long-running tasks happening on Worker Pool. As an example, prefer stream-based APIs over reading the whole file with fs.readFile().
Consider partitioning CPU-intensive tasks if possible.
Once again, choose 3rd-party libraries wisely.

Rule #0 — One Rule to Rule Them All

Now, as a summary, we can formulate a rule of thumb for writing high-performance Node.js server-side applications. This rule of thumb is “Node.js is fast if the work done for each request at any given time is small enough”. This rule covers both Event Loop and Worker Pool.

Futher Reading

As the further reading, I advice you to read the following:

A guide from the node team with more patterns that will help you to avoid blocking Event Loop and Worker Pool: https://nodejs.org/en/docs/guides/dont-block-the-event-loop/
A brilliant series of articles for those who want to get a really deep understanding of how Node.js works internally: https://blog.insiderattack.net/event-loop-and-the-big-picture-nodejs-event-loop-part-1-1cb67a182810

A Pragmatic Overview of Async Hooks API in Node.js

Andrei Pechkurov — Wed, 12 Dec 2018 13:08:37 +0000

Photo by Tom Quandt on Unsplash

Recently I wrote a post about request id tracing in Node.js apps. The proposed solution was built around cls-hooked library, which in its turn uses node’s built-in Async Hooks API. So, I decided to get more familiar with async hooks. In this post I’m going to share my findings and describe some real-world use cases for this API.

Let’s start our journey with a short intro.

An Intro to Async Hooks API

Async Hooks is an experimental API available in Node.js starting from v8.0.0. So, despite of being an experimental API, it exists for about a year and a half and seems to have no critical performance issues and other bugs. Experimental status means that the API may have non-backward compatible changes in future or may be even completely removed. But, considering that this API had a couple of not-so-lucky predecessors, process.addAsyncListener (<v0.12) and AsyncWrap (v6–7, unofficial), Async Hooks API is not the very first attempt and should eventually become a stable API.

The documentation of the async_hooks module describes the purpose of this module in the following fashion:

The async_hooks module provides an API to register callbacks tracking the lifetime of asynchronous resources created inside a Node.js application.

…

An asynchronous resource represents an object with an associated callback. This callback may be called multiple times, for example, the 'connection' event in net.createServer(), or just a single time like in fs.open(). A resource can also be closed before the callback is called. AsyncHook does not explicitly distinguish between these different cases but will represent them as the abstract concept that is a resource.

So, async hooks allow you to to track any (well, almost any) asynchronous stuff that is happening in your node app. Events related with registration and invocation of any callbacks and native promises in your code can be potentially listened via an async hook. In other words, this API allows you to attach listeners to macrotasks and microtasks lifecycle events. Moreover, the API allows listening to low-level async resources from node’s built-in native modules, like fs and net.

The core Async Hooks API can be expressed with the following snippet (a shortened version of this snippet from the docs):

const async_hooks = require('async_hooks')

// ID of the current execution context
const eid = async_hooks.executionAsyncId()
// ID of the handle responsible for triggering the callback of the
// current execution scope to call
const tid = async_hooks.triggerAsyncId()

const asyncHook = async_hooks.createHook({
  // called during object construction
  init: function (asyncId, type, triggerAsyncId, resource) { },
  // called just before resource's callback is called
  before: function (asyncId) { },
  // called just after resource's callback has finished
  after: function (asyncId) { },
  // called when an AsyncWrap instance is destroyed
  destroy: function (asyncId) { },
  // called only for promise resources, when the `resolve`
  // function passed to the `Promise` constructor is invoked
  promiseResolve: function (asyncId) { }
})

// starts listening for async events
asyncHook.enable()
// stops listening for new async events
asyncHook.disable()

You can see that there are not so many functions in Async Hooks API and, in general, it looks quite simple.

The executionAsyncId() function returns an identifier of the current execution context. The triggerAsyncId() function returns an id of the resource that was responsible for calling the callback that is currently being executed (let’s call it a parent or trigger id). The same id(s) are also available in async hook’s event listeners (see the createHook() function).

You can use executionAsyncId() and triggerAsyncId() functions without creating and enabling an async hook. But, in this case, promise executions are not assigned async ids due to the relatively expensive nature of the promise introspection API in V8.

Now, we’re going to focus on behavior of async hooks, as it’s not so obvious how and when callbacks in a created hook will be triggered. As the next step, we’re going to do some experiments with async hooks and learn how they work.

Let’s Play!

Before doing any experiments, we’re going to implement a very primitive async hook. It’ll be storing necessary metadata for the event on init invocation and outputting it into the console for all subsequent invocations. To minimize the console output, it also supports filtering by event types. Here it is:

const asyncHooks = require('async_hooks')

module.exports = (types) => {
  // will contain metadata for all tracked events
  this._tracked = {}

  const asyncHook = asyncHooks.createHook({
    init: (asyncId, type, triggerAsyncId, resource) => {
      if (!types || types.includes(type)) {
        const meta = {
          asyncId,
          type,
          pAsyncId: triggerAsyncId,
          res: resource
        }
        this._tracked[asyncId] = meta
        printMeta('init', meta)
      }
    },
    before: (asyncId) => {
      const meta = this._tracked[asyncId]
      if (meta) printMeta('before', meta)
    },
    after: (asyncId) => {
      const meta = this._tracked[asyncId]
      if (meta) printMeta('after', meta)
    },
    destroy: (asyncId) => {
      const meta = this._tracked[asyncId]
      if (meta) printMeta('destroy', meta)
      // delete meta for the event
      delete this._tracked[asyncId]
    },
    promiseResolve: (asyncId) => {
      const meta = this._tracked[asyncId]
      if (meta) printMeta('promiseResolve', meta)
    }
  })

  asyncHook.enable()

  function printMeta (eventName, meta) {
    console.log(`[${eventName}] asyncId=${meta.asyncId}, ` +
      `type=${meta.type}, pAsyncId=${meta.pAsyncId}, ` +
      `res type=${meta.res.constructor.name}`)
  }
}

We’re going to use it as a module in our experiments, so let’s place it in a file called verbose-hook.js. Now, we’re ready for experiments. For the sake of simplicity, we’ll be mostly using Timers API (to be precise, the setTimeout() function) in our examples.

First, let’s see what happens for a single timer:

require('./verbose-hook')(['Timeout'])

setTimeout(() => {
  console.log('Timeout happened')
}, 0)
console.log('Registered timeout')

This script will product the following output:

[init] asyncId=5, type=Timeout, pAsyncId=1, res type=Timeout
Registered timeout
[before] asyncId=5, type=Timeout, pAsyncId=1, res type=Timeout
Timeout happened
[after] asyncId=5, type=Timeout, pAsyncId=1, res type=Timeout
[destroy] asyncId=5, type=Timeout, pAsyncId=1, res type=Timeout

As we can see, the lifecycle of a single setTimeout operation is very simple and straightforward. It starts with a (synchronous!) call of init listener when the async operation (timeout’s callback) is being added into the timers queue, or, in other words, when an async resource is created. Once the callback is going to be triggered, the before event listener is fired, followed by listeners for after and destroy events when the callback has finished execution.

You may wonder, what will happen in case of nested operations? Let’s see:

require('./verbose-hook')(['Timeout'])

setTimeout(() => {
  console.log('Timeout 1 happened')
  setTimeout(() => {
    console.log('Timeout 2 happened')
  }, 0)
  console.log('Registered timeout 2')
}, 0)
console.log('Registered timeout 1')

This script will produce a longer ouput which looks similar to this one:

[init] asyncId=5, type=Timeout, pAsyncId=1, res type=Timeout
Registered timeout 1
[before] asyncId=5, type=Timeout, pAsyncId=1, res type=Timeout
Timeout 1 happened
[init] asyncId=11, type=Timeout, pAsyncId=5, res type=Timeout
Registered timeout 2
[after] asyncId=5, type=Timeout, pAsyncId=1, res type=Timeout
[destroy] asyncId=5, type=Timeout, pAsyncId=1, res type=Timeout
[before] asyncId=11, type=Timeout, pAsyncId=5, res type=Timeout
Timeout 2 happened
[after] asyncId=11, type=Timeout, pAsyncId=5, res type=Timeout
[destroy] asyncId=11, type=Timeout, pAsyncId=5, res type=Timeout

The output shows that nested async operations have direct correlation in Async Hooks API. The id of the root setTimeout operation (asyncId=5) acts as the parent (or trigger) id for the nested operation (asyncId=11). Another interesting thing shown in this output is that the destroy event for the root call happens before the nested destroy. That’s because destroy listener is called after the resource corresponding to the async operation (the Timeout object in our case) is destroyed.

Another important thing related with the destroy event to notice is that under certain conditions it might not be triggered at all. Here is what official docs say:

Some resources depend on garbage collection for cleanup, so if a reference is made to the resource object passed to init it is possible that destroy will never be called, causing a memory leak in the application. If the resource does not depend on garbage collection, then this will not be an issue.

So, if you’re developing a library based on async hooks, you need to be thinking of possible memory leak issues that your library may introduce.

How about doing some bad things now? Let’s try to create a timeout, then clear it right away and see what events will be registered by the async hook:

require('./verbose-hook')(['Timeout'])

clearTimeout(
  setTimeout(() => {
    console.log('Timeout happened')
  }, 0)
)
console.log('Registered timeout')

This example produces the following output:

[init] asyncId=5, type=Timeout, pAsyncId=1, res type=Timeout
Registered timeout
[destroy] asyncId=5, type=Timeout, pAsyncId=1, res type=Timeout

Despite from being immediately cancelled, the timeout still creates an async resource in Async Hooks terminology. Thus, listeners for init and destroy events are still triggered. This example also shows that after and before events are not guaranteed to be called.

So far, we haven’t seen any promiseResolve events. That’s because we weren’t using any native promises in our examples. Let’s start with the most trivial example:

require('./verbose-hook')(['PROMISE'])

Promise.resolve()
console.log('Registered Promise.resolve')

This script outputs the following into the console:

[init] asyncId=5, type=PROMISE, pAsyncId=1, res type=PromiseWrap
[promiseResolve] asyncId=5, type=PROMISE, pAsyncId=1, res type=PromiseWrap
Registered Promise.resolve

Interestingly, in this example the promiseResolve listener is run synchronously during execution of the Promise.resolve() function. As docs mention, promiseResolve listener will be triggered when the resolve function passed to the Promise constructor is invoked (either directly or through other means of resolving a promise). And in our case resolve function is called synchronously because of the Promise.resolve() function.

As another consequence, promiseResolve (and other listeners) will be triggered multiple times in those cases, when chains of promises are built with then/catch chains. In order to illustrate this, let’s see the following example (this time we’ll be using Promise.reject to make the example a bit more different from the previous one):

require('./verbose-hook')(['PROMISE'])

Promise.reject()
  .catch(() => console.log('Promise.reject callback'))
console.log('Registered Promise.reject')

This script produces the following output:

[init] asyncId=5, type=PROMISE, pAsyncId=1, res type=PromiseWrap
[promiseResolve] asyncId=5, type=PROMISE, pAsyncId=1, res type=PromiseWrap
[init] asyncId=8, type=PROMISE, pAsyncId=5, res type=PromiseWrap
Registered Promise.reject
[before] asyncId=8, type=PROMISE, pAsyncId=5, res type=PromiseWrap
Promise.reject callback
[promiseResolve] asyncId=8, type=PROMISE, pAsyncId=5, res type=PromiseWrap
[after] asyncId=8, type=PROMISE, pAsyncId=5, res type=PromiseWrap

As expected, we see a hierarchy of two of async resources here. The first one (asyncId=5) corresponds to the Promise.reject() invocation, while the second one (asyncId=8) stands for the chained catch() call.

By this point, you should have an understanding of main principles behind the Async Hooks API. Don’t hesitate to do more experiments with other scenarios and types of events.

Now, we’re going to discuss some internal implementation details.

Diving a Bit Deeper

An important note. I’m using one of the latest commits from the node’s master branch in all links below, so the internals may be different in past/future versions. Also there are no code snippets in this sections, so feel free to follow up on the links if you’re interested in seeing the source code.

If you want to understand how Async Hooks are implemented by reading node sources, then the first thing to be checked in the async_hooks module itself. It defines the AsyncHook class which describes objects returned by the createHook() function, as well as so called JS Embedder API. The later allows you to extend AsyncResource class, so that lifetime events of your own resources will be processed by Async Hooks API.

If you continue to dive deeper, you’re going to find the internal/async_hooks module. This module is used by the public one and acts as a bridge between native code and JS part of the API. The C++ part of Async Hooks API is represented by the async_wrap native module, which also defines the public C++ Embedder API. The native API defines AsyncWrap and PromiseWrap classes that we’re going to be considering later. So, these three modules define the main part of Async Hooks API implementation.

Let’s consider a concrete example of call chain that happens behind the scenes right befire an init listener is triggered.

On JS side, the AsyncResource class has a call of emitInit() function in the constructor. This function is defined in the internal/async_hooks module. In its turn, this function calls emitInitNative() function of the same module. Finally, this function synchronously iterates over existing active hooks and calls init listeners for each of them. That’s why we have seen synchronous invocations of init listeners in our experiments.

On C++ side, i.e. for native async resources, the same emitInitNative() function is asynchronously (from native code’s execution perspective) called in the async resource constructor. I don’t think it’s necessary to go through the whole chain of calls this time, so believe me (or check it yourself) that the call is eventually happening in EmitAsyncInit() function.

Expectedly, Async Hooks API (and namely AsyncWrap native class) is integrated into all standard Node.js modules. For example, you may find AsyncWrap in internals of native modules, like crypto and fs, and in many other places. As another example, the promiseResolve event listener is based on a hook for native promises.

In summary, any async stuff that is happening in your node app will be listenable via Async Hooks API. The only exception might be some 3rd-party native modules and wrappers around them. But for those you can still use the Embedder API to integrate them with async hooks.

Now, as we have a better understanding of async hooks internals and principles, let’s think of some real-world problems that we can solve with this API.

But What Are Async Hooks Good For?

The first use case for async hooks is, as we already know, Continuation-local storage (CLS), an async variation of the well-know Thread-local storage concept from the multi-threaded world. In short, the implementation in the cls-hooked library is based on tracking associations between a context object and a chain of async calls starting with an execution of a particular function or a lifecycle of an EventEmitter object. Async Hooks API is used here to listen to async operations and track contexts through the whole execution chain. Without this or similar built-in API, you would have to deal with lots of monkey-patching for all async APIs in Node.js.

The second use case would be about profiling and monitoring. Async hooks in a combination with the Performance Timing API may be used to profile your app. These two APIs allow to collect information about async operations that are happening in the app and measure their duration. For the purpose of web apps monitoring, one can build a middleware that would be gathering request handling statistics in a sampling manner, i.e. for certain percentage of requests (not all of them), thus minimizing the performance impact for the app. This information can be written into a file or streamed in to a standalone service and visualized in many ways later, e.g. as a flame graph.

As a real world example of a profiling tool built on top of Async Hooks API, I can name the Bubbleprof tool which is a part of Clinic.js. Once run, it traces async resources and does a human-readable visualization of the collected data. Check out this blog post to learn more about Bubbleprof.

Hopefully, this post gave you a better undestanding of async hooks. As we have seen, Async Hooks API is a powerful out-of-the-box feature that allows to solve real-world problems in a neat way.

P.S. If you know any other real-world use cases for async hooks, feel free to describe them in comments below. I’d really like to hear about those.

Request Id Tracing in Node.js Applications

Andrei Pechkurov — Thu, 06 Dec 2018 11:25:57 +0000

Photo by Philip Brown on Unsplash

If you ever wrote back-end applications in Node.js, you know that tracing the same HTTP request through log entries is a problem. Usually your logs look something like this:

[07/Nov/2018:15:48:11 +0000] User sign-up: starting request validation

[07/Nov/2018:15:48:11 +0000] User sign-up: starting request validation

[07/Nov/2018:15:48:12 +0000] User sign-up: request validation success

[07/Nov/2018:15:48:13 +0000] User sign-up: request validation failed. Reason:
...

Here, log entries are mixed up and there is no way to determine which of them belong to the same request. While you would probably prefer to see something like this:

[07/Nov/2018:15:48:11 +0000] [request-id:550e8400-e29b-41d4-a716-446655440000] User sign-up: starting request validation

[07/Nov/2018:15:48:11 +0000] [request-id:340b4357-c11d-31d4-b439-329584783999] User sign-up: starting request validation

[07/Nov/2018:15:48:12 +0000] [request-id:550e8400-e29b-41d4-a716-446655440000] User sign-up: request validation success

[07/Nov/2018:15:48:13 +0000] [request-id:340b4357-c11d-31d4-b439-329584783999] User sign-up: request validation failed. Reason:
...

Notice the [request-id:*] part that contains request identifiers here. These identifiers will allow you to filter log entries that belong to the same request. Moreover, if your application is composed of microservices communicating with each other over HTTP, request identifiers may be sent in an HTTP header and used for tracing request chains across logs generated by all microservices on the chain. It’s difficult to overestimate how valuable this feature is for diagnostic and monitoring purposes.

As a developer, you would probably want to configure your web framework and/or logger library in a single place and get request ids generated and printed to logs automatically. But unfortunately, it may be a problem in Node.js world.

In this article we are going to discuss the problem and one of possible solutions.

OK, but is it Even a Problem?

In many other languages and platforms, like JVM and Java Servlet containers, where HTTP servers are built around a multi-threaded architecture and blocking I/O, it’s not a problem. If we put an identifier of the thread that processes HTTP request into logs, it already may serve as a natural filter parameter for tracing a particular request. This solution is far from being ideal, but it can be enhanced further with using Thread-local storage (TLS). TLS is basically a way to store and retrieve key-value pairs in a context associated with the current thread. In our case, it may be used to store ids (and any other diagnostic data) generated for each new request. Many logging libraries have features built around TLS. As an example, check out docs for SLF4J’s Mapped Diagnostic Context.

Due to the asynchronous nature of Node.js, which is based on the event loop, there is simply nothing like the thread-local storage, as your JS code is being processed on a single thread. Without this or similar API, you would have to drag a context object, containing the request id, all across your request handling calls.

Let’s see how it might look like in a simple Express application:

In this imaginary app we had to pass the req object into the fakeDbAccess function, so that we would be able to output the request id to logs. Think of how redundant and error-prone this approach would be in real-world apps that usually have a way more routes and modules.

Luckily, folks from the Node.js community were thinking of alternatives to TLS a long ago. And the most popular of those alternatives is called Continuation-local storage (CLS).

CLS to the Rescue!

The very first implementation of CLS is continuation-local-storage library. It has the following definition of CLS:

Continuation-local storage works like thread-local storage in threaded programming, but is based on chains of Node-style callbacks instead of threads.

If you check the API of the library, you may find it a bit more complicated than, for example, Java’s TLS. But at the core, it’s quite the same. It allows you to associate a context object with a chain of calls and obtain it later.

The initial library was based on using process.addAsyncListener API, which was available until Node.js v0.12, and its polyfill for node v0.12+. The polyfill does a lot of monkey patching which is aimed on wrapping built-in node APIs. That’s the main reason why you should not consider using the original library with modern versions of Node.js.

Fortunately, there is a subsequent version of the CLS library (or a fork to be more precise), which is called cls-hooked. In Node.js >= 8.2.1 it’s using async_hooks, a node’s built-in API. And even though the API is still experimental, this option is much better than the one with a polyfill. If you want to know more about Async Hooks API, check out this post.

Now, when we have the right tool, we know how to approach our initial problem, which is tracing request ids in app logs.

How About Out-of-the-box solutions for my Belowed Express/Koa/another-web-framework?

As you already know, if you want to have request ids in your Node.js app, you may use cls-hooked and integrate it with the web framework that you are using. But probably you would want to use a library that would be doing this stuff.

Recently I was in search of such library and failed to find a good match for the task. Yes, there are several libraries that integrate web frameworks, e.g. Express, with CLS. On the other hand, there are libraries that provide request id generation middlewares. But I did not find a library that would combine both CLS and request ids in order to solve the request id tracing problem.

So, ladies and gentlemen, please meet cls-rtracer, a small library that solves a not-so-small problem. It provides middlewares for Express and Koa that implement CLS-based request id generation and allows to obtain request ids anywhere on your call chain.

The integration with cls-rtracer basically requires two steps. The first one — attach the middleware in the appropriate place. The second — configure your logging library.

That’s how it might look like in case of an Express-based app:

When being run, this app generates a console output similar to the following:

2018-12-06T10:49:41.564Z: The app is listening on 3000

2018-12-06T10:49:49.018Z [request-id:f2fe1a9e-f107-4271-9e7a-e163f87cb2a5]: Starting request handling

2018-12-06T10:49:49.020Z [request-id:f2fe1a9e-f107-4271-9e7a-e163f87cb2a5]: Logs from fakeDbAccess

2018-12-06T10:49:53.773Z [request-id:cd3a33a9-32cb-453b-a0f0-e36c65ff411e]: Starting request handling

2018-12-06T10:49:53.774Z [request-id:cd3a33a9-32cb-453b-a0f0-e36c65ff411e]: Logs from fakeDbAccess

2018-12-06T10:49:54.908Z [request-id:8b352536-d714-4838-a372-a8e2cfcb4f53]: Starting request handling

2018-12-06T10:49:54.910Z [request-id:8b352536-d714-4838-a372-a8e2cfcb4f53]: Logs from fakeDbAccess

Notice that the integration itself includes attaching Express middleware generated by rTracer.expressMiddleware() function call and obtaining request id with rTracer.id() call. So, it’s as simple as it can be. You may also find an example for Koa apps here.

Hopefully, cls-rtracer will help you to solve the request id tracing problem and make diagnosing your Node.js apps a bit easier.

P.S. I must notice that there is a certain performance impact of using Async Hooks. However, it should not be critical for most applications. Check out this comment and the follow-up (thanks Marek Kajda!).

P.P.S. Feel free to request support for additional web frameworks and report found issues.