DEV Community: Ajani James Bilby

The Upper Limits of WebAssembly Net Performance

Ajani James Bilby — Thu, 14 Mar 2024 07:12:16 +0000

Wasmer.io recently released an article announcing their Winter.js 1.0, however looking at the details of their benchmarks it shows that running Winter.js in wasm results in a 12x slow down in performance compared to native.

That large a performance difference just doesn't sit right with me based on my experience with wasm, yes it should be slower, I would believe 4x, but 12x!?!?!? What is this, a PC port of a console game?

Looking at the code bases of most current software with wasm support, if you get to the underlying implementation - it's a lot closer to a software port, than a new target assembly.
I would be willing to bet that over the coming years you will all sorts of 2x 4x 10x articles, about how software X massively improved their wasm performance since they started putting dev hours into it because the technology became relevant.

But I want to know what is the upper limit of wasm, what will be those future performance gains? I don't have a crystal ball, but I do have two smooth rocks in my back-yard and an IDE, so let's hand roll some web assembly and find out.

Banging Some Rocks

First of all we're going to need to bind some OS functions, so we can talk to the OS's network layer. This is done via functions very similar to the POSIX standard called WASIX.
They get imported in kind of a similar way to a DLL, where you specify which functions you're trying to load and from where.

But how do you know where to import these functions?

Idk bro, I just grepped the wasix-libc till I found something that looked right.

There is another important bit to note, Web Assembly is designed for the web, and thus to be sent over the network, so the binaries are designed to be very small. So there is a lot of features to reduce the number of bytes in a binary such as LEB128 integer encoding. But more importantly for us, it means that function signatures are declared separately to function bodies so they can be reused. So you end up with something kind of cursed like this.

(type (;0;) (func (param i32) (result i32)))
(type (;1;) (func (param i32 i32) (result i32)))
(type (;2;) (func (param i32 i32 i32) (result i32)))
(type (;3;) (func (param i32 i32 i32 i32) (result i32)))
(type (;4;) (func (param i32 i32 i32 i32 i32) (result i32)))
(type (;5;) (func))
(type (;6;) (func (param i32)))
(type (;7;) (func (param i32 i32)))
(import "wasix_32v1" "fd_write" (func (;0;) (type 3)))
(import "wasix_32v1" "fd_close" (func (;1;) (type 0)))
(import "wasix_32v1" "sock_open"      (func (;2;) (type 3)))
(import "wasix_32v1" "sock_bind"      (func (;3;) (type 1)))
(import "wasix_32v1" "sock_listen"    (func (;4;) (type 1)))
(import "wasix_32v1" "sock_accept_v2" (func (;5;) (type 3)))
(import "wasix_32v1" "sock_send"      (func (;6;) (type 4)))
(import "wasix_32v1" "sock_status"    (func (;7;) (type 1)))
(import "wasix_32v1" "proc_exit"      (func (;8;) (type 6)))
(import "wasix_32v1" "sock_shutdown"  (func (;9;) (type 1)))

Now we have an outline of all of the functions we're going to use, let's quickly map out the lifetime of our program.
We're really trying to just test the performance of the WASIX network stack, so doing anything to funky with multithreading or advanced algorithms would be more a test of current runtime multithreaded implementation than the root performance drops that might never be removable from the networking interface.

So we want a really hot single threaded loop, that means blocking, but we want to only block in times our CPU couldn't be doing something else anyway.
We also literally don't care anything about what the incoming request says, because we're testing raw TCP throughput request.

Our whole program is pretty much setup, then a loop with three functions in it can go around blazingly fast.

First of all let's get our data out of the way

Opening on the same port for incoming requests
We're always replying with the same message
Only Responding to one request at a time

This means all of this memory can be predefined at compile time to be reused.

So we'll make our struct defining what we're listening for:

(data (i32.const 48) "\01\00")                   ;; sin_family: AF_INET = 0x0001
(data (i32.const 50) "\90\1f")                   ;; sin_port:      8080 = 0x1F90
(data (i32.const 52) "\00\00\00\00")             ;; sin_addr:INADDR_ANY = 0.0.0.0
(data (i32.const 56) "\00\00\00\00\00\00\00\00") ;; sin_zero = char[8] padding for sockaddr compatibility

Now we'll craft our output response - this is encoded with an iovec which is basically just two i32 integers slapped together, the first is a pointer to the start of the buffer, and the second being the length of the buffer.

(data (i32.const 80) "\58\00\00\00\24\00\00\00")
(data (i32.const 88) "HTTP/1.1 200 OK\0d\0a\0d\0aHello, World!")

When we get an incoming request we need a place to store it's details so we can tell the OS which request we're responding to.

(data (i32.const 160) "\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00")

;; stack: offset.255

Imports done, global variables done, now we just need the actual code.

First we need to define our function, and the fact it has two local variables, these will be used to store the file descriptor for the socket we have open, and the file descriptor socket of an incoming request.
These local variables are a lot closer to user defined registers than actual variables.

(func (;10;) (type 5) (local i32 i32)
  ;; ...
)

Now we create a new OS socket, specifying it's an IPv4 socket (AF_INET), and that it's using TCP (SOCK_STREAM), since the early specification of wasm doesn't allow multiple return, the return value from this first call is an error code - but we don't care about that.
We give it a pointer to 255 which is a region that won't interfere with our global data, which after a successful call the file descriptor will be written there, then we load it to a local variable.

;; Create socket using sock_open
i32.const 1    ;; AF_NET
i32.const 1    ;; SOCK_STREAM
i32.const 0    ;; Protocol
i32.const 255  ;; result pointer
call 2         ;; sock_open()
drop           ;; we don't care about errors

;; Load the socket descriptor from memory
i32.const 255
i32.load
local.set 0

Next step, bind the socket to the sockaddr_in we defined earlier in global memory

;; Bind socket to address and port
local.get 0   ;; Socket file descriptor
i32.const 48  ;; Address of the sockaddr_in structure
call 3        ;; sock_bind()
drop          ;; if it's not going to error, hopefully

Tell the OS we're listening for requests, and queue up to 100 pending connections

;; Listen for incoming connections
local.get 0     ;; Socket file descriptor
i32.const 100   ;; Backlog (maximum pending connections)
call 4          ;; sock_listen()
drop            ;; it's just wasted cycles

Now for the hot loop

(loop
  local.get 0    ;; Listening socket file descriptor
  i32.const 0    ;; Desired file descriptor flags (default)
  i32.const 64   ;; result pointer: new socket
  i32.const 160  ;; result pointer: remote address
  call 5         ;; sock_accept_v2()
  drop           ;; we only accept winners in these parts

  ;; Load the new socket descriptor from memory
  i32.const 64
  i32.load
  local.set 1

  ;; Send response to the client
  local.get 1    ;; socket
  i32.const 80   ;; iovs
  i32.const 1    ;; iovs_len
  i32.const 0    ;; No additional flags
  i32.const 160  ;; ptr: remote address
  call 6         ;; sock_send()
  drop           ;; get dropped

  ;; Shutdown the socket
  local.get 1 ;; socket
  i32.const 2 ;; how: SHUT_RDWR
  call 9      ;; sock_shutdown()
  drop        ;; we're done here

  ;; Close the fd
  local.get 1 ;; socket
  call 1      ;; fd_close()
  drop        ;; bye

  br 0
)

Testing Methodology

For testing we want something directly comparable with Winter.js's benchmark so we used Wrk which is made for linux systems.

So a linux system we shall go.

Dual booting with modern Windows + secure boot + TPMs make life painful, so my system doesn't run native linux.

A VPS could have noisy neighbours which will skew results so we can't use one of those.

I had issues compiling some of this stuff for ARM, so Raspberry Pi 4 is out the window.

So I used wsl, which definitely hurt performance - but it will hurt everyone's performance equally so that's okay.

I ran the server.wat through wasmer direct, as well as using wasmer create-exe --llvm to get the highest web assembly performance possible.
However Winter.js giving the same treatment to it's wasm port caused a compilation error I'd need a full time job to debug.

I rewrote the server.wat in C to make server.c as an apples to apples native comparison.

I also ran Winter.js's NodeJS and Bun benchmarks to have a shared point of reference.

For each test I ran it three times, taking the median req/sec avg value for the graph below.

Results

wrk -t12 -c400 -d10s http://127.0.0.1:8080

*I was unable to get Winter.js to compile, so the value on this graph is an estimate based on it's relative performance to Bun, Node and Winter.js (WASIX). For exact details you can see the spreadsheet here

Initially looking at these results the Bun and Winter.js seem super sus if we assume they were single threaded, since the underlying Javascript should be executing on a single thread (this is also why I didn't test Go).

If we think about our hot loop flow path, at what times are we waiting when we could be executing the response (limited to a single thread)?

The only time the listen blocks, is when there are no pending requests because we're waiting for the next request.

And when there is a request waiting the function should instantly return.

When we send data down the socket, there is no waiting their either, because the OS wait for confirmation a single packet is received before sending the next one away.

Shutting down the socket and closing the file descriptor would trigger OS level cleaning of those utilise which would also not cause a wait.

Assuming all of these are handled well by the OS there shouldn't be much of a wait we're only sending and receiving tens of bytes.

So Let's Look at the Syscalls

So I ran the server.wat again, this time with tracing, and then I manually removed the first couple of lines to the logs start when it listens for it's second ever request.
Since the first request will have a long blocking period, because I haven't started the wrk command yet.

RUST_LOG=wasmer_wasix=trace wasmer run server.wat --net &> log.txt

sock_accept_v2: wasmer_wasix::syscalls::wasix::sock_accept: return=Ok(Errno::success) sock=6 fd=9614
sock_accept_v2: wasmer_wasix::syscalls::wasix::sock_accept: close time.busy=137µs time.idle=1.01µs sock=6 fd=9614
sock_send: wasmer_wasix::syscalls::wasix::sock_send: bytes_written=36 fd=9614
sock_send: wasmer_wasix::syscalls::wasix::sock_send: return=Ok(Errno::success) fd=9614 nsent=36
sock_send: wasmer_wasix::syscalls::wasix::sock_send: close time.busy=224µs time.idle=842ns fd=9614 nsent=36
sock_shutdown: wasmer_wasix::syscalls::wasix::sock_shutdown: return=Ok(Errno::notconn) sock=9614
sock_shutdown: wasmer_wasix::syscalls::wasix::sock_shutdown: close time.busy=91.6µs time.idle=781ns sock=9614
fd_close: wasmer_wasix::fs: closing file descriptor fd=9614 inode=9615 ref_cnt=1 pid=1 fd=9614
fd_close: wasmer_wasix::syscalls::wasi::fd_close: return=Ok(Errno::success) pid=1 fd=9614
fd_close: wasmer_wasix::syscalls::wasi::fd_close: close time.busy=191µs time.idle=852ns pid=1 fd=9614

Now we'll do a little bit of python to get the aggregate values, since our hot loop is really tight, it doesn't matter if we sum or mean our times per call, because each call is made once per iteration.

From this it's obvious that our assumption was partly correct, sock_shutdown does basically nothing, same with sock_accept_v2 since we have constant incoming requests, but there are two big problems, the fd_close and sock_send.

fd_close presumably runs all of the necessary OS cleanup on the file descriptor then and there before switching context back to our app, and this is also likely the same for sock_send since in comparison to most system calls, they're very cheap.

The problems is that since we're only making cheap alls, to us they're quite expensive - and this is where Winter.js and Bun can run ahead.

Depending on what mechanism you use to communicate between threads in a program, it can be cheaper than a system call. Hence if instead of doing the expensive sock_send, sock_shutdown and fd_close on our main thread we just throw them over to a secondary slave thread to do our dirty-laundry we could actually seem measurable performance increases. Which is likely the main reason why Winter.js and Bun can pull ahead - because they're likely both doing this.

This is also likely the reason why Winter.js in wasm is super slow, because the multithreaded model in Web Assembly might not be highly optimised, hence the communication between threads could end up being more costly that just running the system call.
This would get us the exact results we saw in our first graph.

Summary

Just like I said in the beginning there is a big chance that current web assembly performance will increase at the programming language level, I think there is still room for improvement based on these graphs.
Web Assembly didn't start with a multithreading specification, it was added later and is still a flag you have to enable on some runtimes, so it makes sense that it might not be well optimised yet.
This is then likely compounded by the fact no programming language is probably using the existing multithreaded systems properly, so the optimisation focused is more on the languages rather than the runtimes.

I don't think Web Assembly will ever reach the performance of native, but that's not the point, all it needs to be is on par with the performance of current virtualisation platforms.
Based on the fact that we can already touch Node performance, the currently available runtimes are suitable for a lot of current server workloads - the question is if it can get to the point where it's applicable for all server work loads.
Where you can just push your complete server bundled as single wasm binary, specify a few environment variables and let the data centers handle it from there.

Source code for benchmarks and raw results can be found in the Appendix

Wasm is not going to save Javascript

Ajani James Bilby — Thu, 20 Jul 2023 05:34:01 +0000

This article is a case study of the performance impact of improving bnf-parser library to be able to take a given BNF syntax input and compile it all the way down to an optimised parser in wasm for execution - to improve parse times of arbitrary syntaxes.

Testing Methodology

For each round of testing we will parse two rather complex BNFs (sequalize, lolcode) via three different parsing methods sequentially. Measuring the total parse time for each method using parse hooks. The library itself is actually boot strapped (it compiles and uses itself), and the first stage of compiling a BNF syntax is parsing it - so this is a valid test case for potential parsers generated by the library.

All three parse methods are tested per round to keep the execution of each method closely coupled to each other to mitigate the impacts of background processes, and V8 optimisations so that these external factors will hopefully affect each of the parsers similarly.

The first two parsers are actually the same wasm compiled parser in two different modes, the first with source mapping enabled, and the second one without. Source mapping is an optional extra parse which can be applied to the wasm parser which correctly maps syntax nodes of the tree to the column, row, and javascript index (don't get me started on UTF-16) it spans. This is an optional extra parse in bnf-parser because it allows the parser to not waste time allocating extra reference objects which aren't necessary for applications which don't need syntax error messages.

These compiled parsers have also had the default optimisations applied within binaryen which should hopefully give them an advantage over the Javascript implementation (assuming V8 optimisations don't have their way).

The third parser is using bnf-parser's legacy parser which behaves kind of like a graph traversal completely in Javascript where the graph structure is generated from a BNF, and the resulting syntax tree for a given input is generated based on this graph traversal (like a DFA).

We ran the testing round 10,000 consecutive times, gathering results using NodeJS perf_hooks, we then also ran the tests a second time with more in-depth hooks put into the bnf-parser artifacts to see exactly what's going on under the hood. These performance measurements where not taken in the tests comparing the different parsers as the act of measuring them would heavily impact the overall performance of the wasm results, they're just that fast.

Results

From the results below we can see a few interesting trends, both the wasm w/ mapping parser and the legacy parse both have significantly higher 99% execution times than their 1%. This is due to the fact both of them are receiving a lot of love from V8's excellent optimiser. For the first couple of runs it's slow, but once the JIT realises it's doing the same thing many times it starts to optimise it.

We also did a test where after each testing round we attempted to parse another random non bnf syntax, to see if it threw off V8's optimisations due to the graph traversal functions now running on a different graph to the one it was optimised for. However that had no negligible effect.

	Wasm `w/ map`	Wasm `no map`	Legacy
Max	6.1372ms	2.5623ms	13.5988ms
99%	2.3481ms	0.8897ms	2.2354ms
50%	1.5971ms	0.6350ms	1.6305ms
1%	0.6533ms	0.2202ms	0.4673ms
Min	0.6437ms	0.2173ms	0.4602ms
Mean	1.2774ms	0.5384ms	1.2102ms

Comparing the median legacy times to wasm no map we can see an approximate 2.56x - however only legacy is generating SyntaxNode references, so it's not a fair comparison of equivalent compute. Comparing the wasm parser with source mapping to legacy we see only a 1.04x improvement.

And that might get you thinking - wow, JS really isn't that slow. But comparing wasm to raw JS performance isn't fair either, because you're missing a step. You need to move data in and out of the JS world to the wasm instance, and that has a tax.

The transport tax

There are four main stages of using the wasm bnf-parser;

Encoding: This is where you take the input data from Javascript, and write it into the wasm instance's memory, and also tell the instance how long the data you just put in is
Parsing: This is the actual work we want to complete, this is iterating over the string and generating the entire syntax tree
Decoding: We want to be able to use that tree in JS, so we need to load it back out to be useful - bring it back over to JS land.
Mapping: This is generating the source references for a given syntax tree, based on the input data.

It's important to note that the mapping part is almost entirely done in Javascript rather than in wasm, because the computation is super simple, you're just iterating forward over a string counting the index as you depth first traverse over the syntax tree filling in the references (this is done using stack operations, so it's a single function call to save on the extra tax of recursive calls in JS).

Since the majority of the complex work being done is simply allocating new objects to store the reference at each point - there will be no real time-saved by doing this in WASM, and any time saved will be mostly lost due to the data transfer tax.

	Encode	Parse	Decode	Mapping
Max	0.2647ms	0.2692ms	2.6253ms	3.5991ms
99%	0.0064ms	0.1443ms	1.0704ms	1.1919ms
50%	0.0026ms	0.1335ms	0.6914ms	0.9063ms
1%	0.0023ms	0.0436ms	0.1738ms	0.4232ms
Min	0.0020ms	0.0428ms	0.1720ms	0.4160ms
Mean	0.0032ms	0.0910ms	0.5242ms	0.7071ms

From this data we can see we are spending 40.03% of our time just moving data between JS to WASM land - that's almost half of the entire computation.
We can also see the other 55.41% is taken up by generating the source references.

Leaving only 7.70% of the time we took to run this parse actually computing the syntax tree!!!

What's up with Javascript's GST rates being so high?

The reason this tax for transferring data is so high is painfully illustrated by the difference in compute time between decode and mapping. Mapping is much simpler, than trying to traverse a tree generated by a different language where you need to worry about bit alignment, as well as actually decoding the foreign data into something Javascript can use.

The reason is object allocation.

Everything in Javascript is an object, even the object within an object is an object - and that's objectively a problem.

In any statically typed language if you allocate a struct which has another struct as it's member, you get that child for free. That isn't the case in Javascript. Every SyntaxNode, has a ReferenceRange which contains two Reference objects - so that means if you want to allocate a SyntaxNode and fill in all of it's children, that's actually 4 allocations, not 1.

The reason decoding is able to be as fast as it is; is because of object reuse. By default every single SyntaxNode actually shares the same ReferenceRange instance, that means that range and it's two children only need to be allocated once, and every SyntaxNode gets a ReferenceRange so now you don't need to do null checks everywhere - and we only have one allocation per node.

But when you run the source map over the syntax tree, now for every single SyntaxNode you have to perform 3 allocations: ReferenceRange, start Reference, and end Reference.

Part of the reason the execution in wasm is actually so fast is because it only does one allocation, the entire tree itself.
The entire tree represented in wasm is actually flat packed into linear memory. And since after every parse the data is read out, we don't need the previous tree after each parse - so we can just write over it. So we have zero allocations because we just use the same single allocation. In other languages line C++ you could allocate a vector a factor or two larger than your estimated tree size, then compute your flat tree, then shrink afterwards. Two allocations.

In Javascript everything is an object, everything must be independently allocated.

Can Wasm still work as an accelerator?

Wasm libraries can still work as an accelerator to Javascript, in almost an identical way to how everything in Python is actually a C library. You could have a library for matrix multiplication, and all of your matrices are permanently stored in WASM, only coming out after computation is complete to be printed, sent over the network, or written to file.

So much like the current Python eco system, JS could lead towards a world where it's a glue language - the problem is that for typical Javascript workflows it's 90% glue.

For the vast majority of Javascript execution it's:

Take something from the network
Perform a small amount of manipulation
Send it out to the network, or write to the DOM.

JS is primarily a middle man language, attempting to use it like python to abstract the middle man's duty to another person creating another middle man leads to very little performance gain, and a whole lot of headache. Try talking to the tech support of any major tech company and you'll see what I mean.

Wasm also has another extra headache. It's security focused, meaning every wasm instance has it's own independent memory, that means two different wasm libraries can't actually share memory, unless they are recompiled together, or you parse data between wasm libraries much like you do from JS to wasm.

Plus the majority of wasm compiled modules don't actually play nicely together, they're compiled to rule their sandpit, and no one else can enter. If you attempted to bring a C++ library and Rust library into the same WASM module, who's malloc implementation are we using? There is only one linear memory, and they can't both be operating on the same space.
Who's are we choosing? How do we choose? How do each of the children know who was chosen?

Wasm is what people wanted docker to be

Wasm is a really powerful tool, but I think people are miss understanding where it's heading and what it will be great for.
No it will not be great for bringing that one Rust library to use in your TS workflow.
It's better to think of it like a super light weight and actually portable docker container that can execute anywhere.

You can bring that container into the browser to act as your front end, or you can have it running as micro-service, or the entire backend.
What it's not is a way to use language X's library in language Y.

Async functions are needlessly killing your Javascript performance

Ajani James Bilby — Sat, 01 Apr 2023 13:00:00 +0000

While numerous articles offer quick tips to enhance your async JavaScript performance using extra Promise API features, this discussion focuses on how program structure tweaks can lead to significant speed improvements. By optimizing your code, you could potentially achieve 1.9x or even 14x speed boosts.

I believe the untapped performance potential in asynchronous JavaScript features is due to the V8 engine not providing the expected level of optimization for these features; and there are a few key indicators that suggest this possibility.

Context

You can skip this section if you want, but here's a brief overview of the context. I've been working with a bnf-parser library that currently needs a complete file to be loaded for parsing it into a BNF-specified syntax tree. However, the library could be refactored to use cloneable state generators, which output file characters sequentially and allow for copying at a specific point to resume reading later.

So I tried to implementing it in Javascript be able to parse large +1GB files into partial syntax trees for processing large XML, just partly for fun, partly because I know also soon I'll need to be implementing something similar in a lower level language and this could be good practice.

The Case Study

I aimed to create a layer between the readable data stream from disk and allowing iteratively calling forward for small text portions with limited backtracking. I implemented a Cursor that iterates forward, returning the passed-over characters as a string. Cursors can be cloned, and clones can independently move forward. Importantly cursors may need to wait for data currently streamed to become available before returning the next substring. To minimize memory usage, we discard unreachable data - implementing all of this into a async/await pattern to avoid complex callback chains or unnecessary event loop blocking.

Side note: We use pooling for caching, placing each chunk read from the disk into an array and manipulating the array to free cached data. This method reduces resize operations and string manipulation. However, it can cause NodeJS to report false memory usage, as chunks allocated by the OS are sometimes not counted until manipulated within the application domain.

The cursor features an async read call, asynchronously connecting to a StreamCache to read from the cache. Multiple cursors may attempt to read the latest unavailable information, requiring a condition variable lock - an async call to a PromiseQueue is used to manages this.

Reading a 1GB file in 100-byte chunks leads to at least 10,000,000 IOs through three async call layers. The problem becomes catastrophic since these functions are essentially language-level abstractions of callbacks, lacking optimizations that come with their async nature. However, we can manually implement optimizations to alleviate this issue.

Testing

So let's go through the base implementation, then a few different variations and optimisations; or you can skip ahead to the #results then work your way backwards if you prefer.

A quick note about the testing methodology: Each test ran 10 times consecutively, starting from a cold state. The first result was consistently slower, while the other nine were nearly identical. This suggests either NodeJS temporarily saves optimized code between runs, or the NAS intelligently caches the file for quicker access. The latter is more likely, as longer durations between cold starts result in slower initial executions.

The test file used is here (streamed as a standalone XML file).

Full Async

So we have a cursor which we can call next on, which forwards the request to the StreamCache - which then handles all of the actual read behaviour.

class Cursor {
  // ...
  async next(highWaterMark = 1): Promise<string> {
    return await this._owner._read(this, highWaterMark);
  }
  // ...
};

We then have our main file which just creates a StreamCache, adds a cursor, and piping a fs.createReadStream in a kind of backwards way to the normal piping API, but this is due to the way StreamCache has been implemented to allow for NodeJS and WebJS readable stream API differences.

The cursor is added before piping to ensure the first bytes of data can't be read into the cache, then dropped because of it being inaccessible by any cursors

let stream = new experimental.StreamCache();
let cursorA = stream.cursor();
stream.pipe_node(fstream);

async function main() {
  console.time("duration");

  while (!cursorA.isDone()) {
    let val = await cursorA.next(100);
    read += val.length;
  }

  cursorA.drop();

}
fstream.on('end', ()=>{
  console.timeEnd("duration");
});

main();

Wrapper Optimisation

In the cursor before we could see we had an async function basically just acting as a wrapper, if you understand the async abstraction you'd know an async function just returns a promise, so there is no actual need in creating this extra async function, and instead we can just return the one created from the child call. (This has a level of performance benefit it really shouldn't :D)

To:

class Cursor {
  next(highWaterMark = 1): Promise<string> {
    return this._owner._read(this, highWaterMark);
  }
};

Inlined

In this case we pretended to be a compiler, and inlined our own function, so we literally just embedded the functionality of StreamCache._read into where it was being called, which completely broken our public private attribute protections 😶🔫

If only there was a compiler like Typescript to do inlining safely for us 👀

let stream = new experimental.StreamCache();
let cursorA = stream.cursor();
stream.pipe_node(fstream);

async function main() {
  console.time("duration");

  while (!cursorA.isDone()) {
    if (cursorA._offset < 0) {
      throw new Error("Cursor behind buffer position");
    }

    while (cursorA._offset > stream._total_cache - 100) {
      if (stream._ended) {
        break;
      }

      await stream._signal.wait();
    }

    let loc = stream._offset_to_cacheLoc(cursorA._offset);
    if (loc[0] >= stream._cache.length) {
      return "";
    }

    let out = stream._cache[loc[0]].slice(loc[1], loc[1]+100);
    cursorA._offset += out.length;
    read += out.length;
  }

  cursorA.drop();

}
main();

fstream.on('end', ()=>{
  console.timeEnd("duration");
});

Async With Peaking

If all else fails, avoid async when possible. So in this case I added a few functions.

Peak will tell me if I can read without waiting, and in which case _skin_read is safe to call.

Otherwise go back to calling the async method.

let stream = new experimental.StreamCache();
let cursorA = stream.cursor();
stream.pipe_node(fstream);

async function main() {
  console.time("duration");

  while (!cursorA.isDone()) {
    let val = cursorA._skip_read(100);
    if (cursorA.isDone()) {
      break;
    }
    read += val.length;
    peaked += val.length;

    if (val == "") {
      let val = await cursorA.next(100);
      read += val.length;
    }
  }

  cursorA.drop();
}
main();

fstream.on('end', ()=>{
  console.timeEnd("duration");
});

In this use case this actually save a lot of time because a large amount of the calls didn't actually need to wait due to the load chunk sizes being so large.

	Bits Read
Via Async	919417
Via Peaking	1173681200
Total	1174600617

Disk Read

As with all good tests, we need a base line - so in this case we don't even have an active cursor, we literally just let data flow in and out of the StreamCache as fast as possible giving us the limitation of our disk read, plus the alloc and free overhead as we add and remove cache pools.

let stream = new experimental.StreamCache();
let cursorA = stream.cursor();
stream.pipe_node(fstream);

async function main() {
  console.time("duration");
  cursorA.drop();

}
main();

fstream.on('end', ()=>{
  console.timeEnd("duration");
});

Callback

Finally we need a test to make sure this isn't a de-optimisation bug, if we go back to the callback hell days, however do we fair?

Note: I didn't rewrite the signal.wait() as trying to create an optimised call back system inside a for loop will be hell on earth to implement.
And yes we do need a while loop, because it might take more than one chunk to load in to fulfill the requested read - chunk sizes can be weird sometimes and inconsistent, plus maybe you just want a large chunk read at once 🤷

export class StreamCache {
  async read(cursor: Cursor, size = 1, callback: (str: string) => void): Promise<void> {
    if (cursor._offset < 0) {
      throw new Error("Cursor behind buffer position");
    }

    // Wait for more data to load if necessary
    while (cursor._offset > this._total_cache - size) {
      // The required data will never be loaded
      if (this._ended) {
        break;
      }

      // Wait for more data
      //   Warn: state might change here (including cursor)
      await this._signal.wait();
    }

    // Return the data
    let loc = this._offset_to_cacheLoc(cursor._offset);
    if (loc[0] >= this._cache.length) {
      callback("");
    }

    let out = this._cache[loc[0]].slice(loc[1], loc[1]+size);
    cursor._offset += out.length;
    callback(out);
  }
}

function ittr(str: string) {
  read += str.length;
  if (cursorA.isDone()) {
    cursorA.drop();
    return;
  }

  stream.read(cursorA, 100, ittr);
}

async function main() {
  console.time("duration");
  ittr("");
}
main();

fstream.on('end', ()=>{
  console.timeEnd("duration");
});

Results

Case	Duration (Min)	Median	Mean	Max
#Full Async	27.742s	28.339s	28.946s	35.203s
#Async Wrapper Opt	14.758s	14.977s	15.761s	22.847s
#Callback	13.753s	13.902s	14.683s	21.909s
#Inlined Async	2.025s	2.048s	3.037s	11.847s
#Async w/ Peaking	1.970s	2.085s	3.054s	11.890s
#Disk Read	1.970s	1.996s	2.982s	11.850s

It's kind of terrifying how well changing just the wrapper function Cursor.next is, it shows that there is easily optimisation improvements available, that plus the inlining 13.9x performance improvement shows that there is room that even if V8 doesn't get around to implementing something, tools like Typescript certainly could.

Also if you look at the peaking example, we hit quite an interesting limit. In that case only 0.078% of requests were fulfilled by the async function, meaning only about 9194 of 11746006 requests were waiting for the data to be loaded. This would imply our CPU is almost perfectly being feed by the incoming data.

Conclusion

The performance of asynchronous JavaScript functions can be significantly improved by making simple tweaks to the code. The results of this case study demonstrate the potential for 1.9x to 14x speed boosts with manual optimizations. The V8's current lack of optimization for these features leaves room for further improvements in the future.

When using direct raw Promise API calls, there can be a strong argument made that attempting to optimise this behaviour without potentially altering execution behaviour can be extraordinarily hard to implement. But when we use the async/await syntax without even using the term Promise, our functions are now written in such a way you can make some pretty easy performance guaranteed optimisations.

The fact that simply #altering the wrapper call creates an almost 1.9x boost in performance should be horrifying for anyone who has used a compiled language. It's a simple function call redirection and can be easily optimised out of existence in most cases.

We don't need to wait for the browsers to implement these optimisations, tools such as Typescript already offer transpiling to older ES version, clearly showing the compiler infrastructure has a deep understanding of the behaviour of the language. For a long time people have been saying that Typescript doesn't need to optimise your Javascript, since V8 already does such a good job, however that clearly isn't the case with this new async syntax - and with a little bit of static analysis an inlining alone Javascript can become way more performant.

Take Away

Currently in V8's implementation of Javascript, async is just an abstraction of Promises, and Promises are just an abstraction of callbacks, and V8 doesn't appear to use the added information that an async function provides over a traditional callback to make any sort of optimisations.

While the majority of active async Javascript code is probably IO bounded instead of CPU, this likely won't affect the majority of Javascript code. However your code can still potentially be limited by these performance characteristics even if you're not the one doing the heavy CPU load. Potentially based on how you to interface with a given library could give you massively different performance characteristics depending no if you're using non-synchronous code or not, and the problem can be exacerbated depending on the implementation details of the library.

What you can do now

As a general rule, try and avoid async when possible - and no callbacks are not the solution, because it has the same performance impact.
When possible instead of creating a new Promise bounded by another - attempt to merge them into a single Promise when possible.