<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ajani James Bilby</title>
    <description>The latest articles on DEV Community by Ajani James Bilby (@ajanibilby).</description>
    <link>https://dev.to/ajanibilby</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F176043%2F719f2286-ec1b-4730-8785-121be70b62ef.jpg</url>
      <title>DEV Community: Ajani James Bilby</title>
      <link>https://dev.to/ajanibilby</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ajanibilby"/>
    <language>en</language>
    <item>
      <title>The Upper Limits of WebAssembly Net Performance</title>
      <dc:creator>Ajani James Bilby</dc:creator>
      <pubDate>Thu, 14 Mar 2024 07:12:16 +0000</pubDate>
      <link>https://dev.to/ajanibilby/the-upper-limits-of-webassembly-performance-1j29</link>
      <guid>https://dev.to/ajanibilby/the-upper-limits-of-webassembly-performance-1j29</guid>
      <description>&lt;p&gt;Wasmer.io recently released an article announcing their &lt;a href="https://wasmer.io/posts/winterjs-v1"&gt;Winter.js 1.0&lt;/a&gt;, however looking at the details of their &lt;a href="https://github.com/wasmerio/winterjs/tree/main/benchmark"&gt;benchmarks&lt;/a&gt; it shows that running Winter.js in wasm results in a 12x slow down in performance compared to native.&lt;/p&gt;

&lt;p&gt;That large a performance difference just doesn't sit right with me based on my experience with wasm, yes it should be slower, I would believe 4x, but 12x!?!?!? What is this, a PC port of a console game?&lt;/p&gt;

&lt;p&gt;Looking at the code bases of most current software with wasm support, if you get to the underlying implementation - it's a lot closer to a software port, than a new target assembly.&lt;br&gt;
I would be willing to bet that over the coming years you will all sorts of 2x 4x 10x articles, about how software X massively improved their wasm performance since they started putting dev hours into it because the technology became relevant.&lt;/p&gt;

&lt;p&gt;But I want to know what is the upper limit of wasm, what will be those future performance gains? I don't have a crystal ball, but I do have two smooth rocks in my back-yard and an IDE, so let's hand roll some web assembly and find out.&lt;/p&gt;
&lt;h2&gt;
  
  
  Banging Some Rocks
&lt;/h2&gt;

&lt;p&gt;First of all we're going to need to bind some OS functions, so we can talk to the OS's network layer. This is done via functions very similar to the POSIX standard called &lt;a href="https://wasix.org/"&gt;WASIX&lt;/a&gt;.&lt;br&gt;
They get imported in kind of a similar way to a DLL, where you specify which functions you're trying to load and from where.&lt;/p&gt;

&lt;p&gt;But how do you know where to import these functions?&lt;br&gt;&lt;br&gt;
Idk bro, I just grepped the &lt;a href="https://github.com/wasix-org/wasix-libc"&gt;wasix-libc&lt;/a&gt; till I found something that looked right.&lt;/p&gt;

&lt;p&gt;There is another important bit to note, Web Assembly is designed for the web, and thus to be sent over the network, so the binaries are designed to be very small. So there is a lot of features to reduce the number of bytes in a binary such as &lt;a href="https://en.wikipedia.org/wiki/LEB128"&gt;LEB128&lt;/a&gt; integer encoding. But more importantly for us, it means that function signatures are declared separately to function bodies so they can be reused. So you end up with something kind of cursed like this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(type (;0;) (func (param i32) (result i32)))
(type (;1;) (func (param i32 i32) (result i32)))
(type (;2;) (func (param i32 i32 i32) (result i32)))
(type (;3;) (func (param i32 i32 i32 i32) (result i32)))
(type (;4;) (func (param i32 i32 i32 i32 i32) (result i32)))
(type (;5;) (func))
(type (;6;) (func (param i32)))
(type (;7;) (func (param i32 i32)))
(import "wasix_32v1" "fd_write" (func (;0;) (type 3)))
(import "wasix_32v1" "fd_close" (func (;1;) (type 0)))
(import "wasix_32v1" "sock_open"      (func (;2;) (type 3)))
(import "wasix_32v1" "sock_bind"      (func (;3;) (type 1)))
(import "wasix_32v1" "sock_listen"    (func (;4;) (type 1)))
(import "wasix_32v1" "sock_accept_v2" (func (;5;) (type 3)))
(import "wasix_32v1" "sock_send"      (func (;6;) (type 4)))
(import "wasix_32v1" "sock_status"    (func (;7;) (type 1)))
(import "wasix_32v1" "proc_exit"      (func (;8;) (type 6)))
(import "wasix_32v1" "sock_shutdown"  (func (;9;) (type 1)))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we have an outline of all of the functions we're going to use, let's quickly map out the lifetime of our program.&lt;br&gt;
We're really trying to just test the performance of the WASIX network stack, so doing anything to funky with multithreading or advanced algorithms would be more a test of current runtime multithreaded implementation than the root performance drops that might never be removable from the networking interface.&lt;/p&gt;

&lt;p&gt;So we want a really hot single threaded loop, that means blocking, but we want to only block in times our CPU couldn't be doing something else anyway.&lt;br&gt;
We also literally don't care anything about what the incoming request says, because we're testing raw TCP throughput request.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fti3kzhbqpoa4gjrea7ey.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fti3kzhbqpoa4gjrea7ey.png" alt="Image description" width="458" height="575"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Our whole program is pretty much setup, then a loop with three functions in it can go around blazingly fast.&lt;/p&gt;

&lt;p&gt;First of all let's get our data out of the way&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Opening on the same port for incoming requests&lt;/li&gt;
&lt;li&gt;We're always replying with the same message&lt;/li&gt;
&lt;li&gt;Only Responding to one request at a time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means all of this memory can be predefined at compile time to be reused.&lt;/p&gt;



&lt;p&gt;So we'll make our struct defining what we're listening for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(data (i32.const 48) "\01\00")                   ;; sin_family: AF_INET = 0x0001
(data (i32.const 50) "\90\1f")                   ;; sin_port:      8080 = 0x1F90
(data (i32.const 52) "\00\00\00\00")             ;; sin_addr:INADDR_ANY = 0.0.0.0
(data (i32.const 56) "\00\00\00\00\00\00\00\00") ;; sin_zero = char[8] padding for sockaddr compatibility
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we'll craft our output response - this is encoded with an &lt;code&gt;iovec&lt;/code&gt; which is basically just two &lt;code&gt;i32&lt;/code&gt; integers slapped together, the first is a pointer to the start of the buffer, and the second being the length of the buffer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(data (i32.const 80) "\58\00\00\00\24\00\00\00")
(data (i32.const 88) "HTTP/1.1 200 OK\0d\0a\0d\0aHello, World!")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When we get an incoming request we need a place to store it's details so we can tell the OS which request we're responding to.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(data (i32.const 160) "\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00\00")

;; stack: offset.255
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Imports done, global variables done, now we just need the actual code.&lt;/p&gt;




&lt;p&gt;First we need to define our function, and the fact it has two local variables, these will be used to store the file descriptor for the socket we have open, and the file descriptor socket of an incoming request.&lt;br&gt;
These local variables are a lot closer to user defined registers than actual variables.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(func (;10;) (type 5) (local i32 i32)
  ;; ...
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we create a new OS socket, specifying it's an IPv4 socket (&lt;code&gt;AF_INET&lt;/code&gt;), and that it's using TCP (&lt;code&gt;SOCK_STREAM&lt;/code&gt;), since the early specification of wasm doesn't allow multiple return, the return value from this first call is an error code - but we don't care about that.&lt;br&gt;
We give it a pointer to &lt;code&gt;255&lt;/code&gt; which is a region that won't interfere with our global data, which after a successful call the file descriptor will be written there, then we load it to a local variable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;;; Create socket using sock_open
i32.const 1    ;; AF_NET
i32.const 1    ;; SOCK_STREAM
i32.const 0    ;; Protocol
i32.const 255  ;; result pointer
call 2         ;; sock_open()
drop           ;; we don't care about errors

;; Load the socket descriptor from memory
i32.const 255
i32.load
local.set 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next step, bind the socket to the &lt;code&gt;sockaddr_in&lt;/code&gt; we defined earlier in global memory&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;;; Bind socket to address and port
local.get 0   ;; Socket file descriptor
i32.const 48  ;; Address of the sockaddr_in structure
call 3        ;; sock_bind()
drop          ;; if it's not going to error, hopefully
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tell the OS we're listening for requests, and queue up to 100 pending connections&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;;; Listen for incoming connections
local.get 0     ;; Socket file descriptor
i32.const 100   ;; Backlog (maximum pending connections)
call 4          ;; sock_listen()
drop            ;; it's just wasted cycles
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now for the hot loop&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(loop
  local.get 0    ;; Listening socket file descriptor
  i32.const 0    ;; Desired file descriptor flags (default)
  i32.const 64   ;; result pointer: new socket
  i32.const 160  ;; result pointer: remote address
  call 5         ;; sock_accept_v2()
  drop           ;; we only accept winners in these parts

  ;; Load the new socket descriptor from memory
  i32.const 64
  i32.load
  local.set 1

  ;; Send response to the client
  local.get 1    ;; socket
  i32.const 80   ;; iovs
  i32.const 1    ;; iovs_len
  i32.const 0    ;; No additional flags
  i32.const 160  ;; ptr: remote address
  call 6         ;; sock_send()
  drop           ;; get dropped

  ;; Shutdown the socket
  local.get 1 ;; socket
  i32.const 2 ;; how: SHUT_RDWR
  call 9      ;; sock_shutdown()
  drop        ;; we're done here

  ;; Close the fd
  local.get 1 ;; socket
  call 1      ;; fd_close()
  drop        ;; bye

  br 0
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Testing Methodology
&lt;/h2&gt;

&lt;p&gt;For testing we want something directly comparable with Winter.js's benchmark so we used &lt;a href="https://github.com/wg/wrk"&gt;Wrk&lt;/a&gt; which is made for linux systems.&lt;br&gt;&lt;br&gt;
So a linux system we shall go.&lt;/p&gt;

&lt;p&gt;Dual booting with modern Windows + secure boot + TPMs make life painful, so my system doesn't run native linux.&lt;br&gt;&lt;br&gt;
A VPS could have noisy neighbours which will skew results so we can't use one of those.&lt;br&gt;&lt;br&gt;
I had issues compiling some of this stuff for ARM, so Raspberry Pi 4 is out the window.&lt;br&gt;&lt;br&gt;
So I used wsl, which definitely hurt performance - but it will hurt everyone's performance equally so that's okay.   &lt;/p&gt;

&lt;p&gt;I ran the &lt;a href="https://www.ajanibilby.com/blog/the-upper-limit-of-wasm-performance/appendix#benchmark-source"&gt;&lt;code&gt;server.wat&lt;/code&gt;&lt;/a&gt; through &lt;code&gt;wasmer direct&lt;/code&gt;, as well as using &lt;code&gt;wasmer create-exe --llvm&lt;/code&gt; to get the highest web assembly performance possible.&lt;br&gt;
However Winter.js giving the same treatment to it's wasm port caused a compilation error I'd need a full time job to debug.&lt;/p&gt;

&lt;p&gt;I rewrote the &lt;code&gt;server.wat&lt;/code&gt; in C to make &lt;a href="https://www.ajanibilby.com/blog/the-upper-limit-of-wasm-performance/appendix#benchmark-source"&gt;&lt;code&gt;server.c&lt;/code&gt;&lt;/a&gt; as an apples to apples native comparison.&lt;/p&gt;

&lt;p&gt;I also ran Winter.js's NodeJS and Bun benchmarks to have a shared point of reference.&lt;/p&gt;

&lt;p&gt;For each test I ran it three times, taking the median &lt;code&gt;req/sec avg&lt;/code&gt; value for the graph below.&lt;/p&gt;
&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;wrk &lt;span class="nt"&gt;-t12&lt;/span&gt; &lt;span class="nt"&gt;-c400&lt;/span&gt; &lt;span class="nt"&gt;-d10s&lt;/span&gt; http://127.0.0.1:8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--iX3fcfII--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://www.ajanibilby.com/blog/the-upper-limit-of-wasm-performance/chart.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--iX3fcfII--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://www.ajanibilby.com/blog/the-upper-limit-of-wasm-performance/chart.jpg" alt="Chart" width="758" height="411"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;*I was unable to get Winter.js to compile, so the value on this graph is an estimate based on it's relative performance to &lt;code&gt;Bun&lt;/code&gt;, &lt;code&gt;Node&lt;/code&gt; and &lt;code&gt;Winter.js (WASIX)&lt;/code&gt;. For exact details you can see the spreadsheet &lt;a href="//./chart.xlsx"&gt;here&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Initially looking at these results the Bun and Winter.js seem super sus if we assume they were single threaded, since the underlying Javascript should be executing on a single thread (this is also why I didn't test Go).&lt;/p&gt;

&lt;p&gt;If we think about our hot loop flow path, at what times are we waiting when we could be executing the response (limited to a single thread)?&lt;br&gt;&lt;br&gt;
The only time the listen blocks, is when there are no pending requests because we're waiting for the next request.&lt;br&gt;&lt;br&gt;
And when there is a request waiting the function should instantly return.&lt;/p&gt;

&lt;p&gt;When we send data down the socket, there is no waiting their either, because the OS wait for confirmation a single packet is received before sending the next one away.&lt;br&gt;&lt;br&gt;
Shutting down the socket and closing the file descriptor would trigger OS level cleaning of those utilise which would also not cause a wait.&lt;br&gt;&lt;br&gt;
Assuming all of these are handled well by the OS there shouldn't be much of a wait we're only sending and receiving &lt;strong&gt;tens&lt;/strong&gt; of bytes.&lt;/p&gt;
&lt;h2&gt;
  
  
  So Let's Look at the Syscalls
&lt;/h2&gt;

&lt;p&gt;So I ran the &lt;code&gt;server.wat&lt;/code&gt; again, this time with tracing, and then I manually removed the first couple of lines to the logs start when it listens for it's second ever request.&lt;br&gt;
Since the first request will have a long blocking period, because I haven't started the &lt;code&gt;wrk&lt;/code&gt; command yet.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;RUST_LOG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;wasmer_wasix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;trace wasmer run server.wat &lt;span class="nt"&gt;--net&lt;/span&gt; &amp;amp;&amp;gt; log.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sock_accept_v2: wasmer_wasix::syscalls::wasix::sock_accept: return=Ok(Errno::success) sock=6 fd=9614
sock_accept_v2: wasmer_wasix::syscalls::wasix::sock_accept: close time.busy=137µs time.idle=1.01µs sock=6 fd=9614
sock_send: wasmer_wasix::syscalls::wasix::sock_send: bytes_written=36 fd=9614
sock_send: wasmer_wasix::syscalls::wasix::sock_send: return=Ok(Errno::success) fd=9614 nsent=36
sock_send: wasmer_wasix::syscalls::wasix::sock_send: close time.busy=224µs time.idle=842ns fd=9614 nsent=36
sock_shutdown: wasmer_wasix::syscalls::wasix::sock_shutdown: return=Ok(Errno::notconn) sock=9614
sock_shutdown: wasmer_wasix::syscalls::wasix::sock_shutdown: close time.busy=91.6µs time.idle=781ns sock=9614
fd_close: wasmer_wasix::fs: closing file descriptor fd=9614 inode=9615 ref_cnt=1 pid=1 fd=9614
fd_close: wasmer_wasix::syscalls::wasi::fd_close: return=Ok(Errno::success) pid=1 fd=9614
fd_close: wasmer_wasix::syscalls::wasi::fd_close: close time.busy=191µs time.idle=852ns pid=1 fd=9614
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we'll do a little bit of python to get the aggregate values, since our hot loop is really tight, it doesn't matter if we sum or mean our &lt;code&gt;time&lt;/code&gt;s per call, because each call is made once per iteration.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--bISiNGhz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://www.ajanibilby.com/blog/the-upper-limit-of-wasm-performance/sys-call-graph.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--bISiNGhz--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://www.ajanibilby.com/blog/the-upper-limit-of-wasm-performance/sys-call-graph.png" alt="syscall graph" width="800" height="484"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;From this it's obvious that our assumption was partly correct, &lt;code&gt;sock_shutdown&lt;/code&gt; does basically nothing, same with &lt;code&gt;sock_accept_v2&lt;/code&gt; since we have constant incoming requests, but there are two big problems, the &lt;code&gt;fd_close&lt;/code&gt; and &lt;code&gt;sock_send&lt;/code&gt;.  &lt;/p&gt;

&lt;p&gt;&lt;code&gt;fd_close&lt;/code&gt; presumably runs all of the necessary OS cleanup on the file descriptor then and there before switching context back to our app, and this is also likely the same for &lt;code&gt;sock_send&lt;/code&gt; since in comparison to most system calls, they're very cheap.&lt;br&gt;&lt;br&gt;
The problems is that since we're only making cheap alls, to us they're quite expensive - and this is where Winter.js and Bun can run ahead.&lt;/p&gt;

&lt;p&gt;Depending on what mechanism you use to communicate between threads in a program, it can be cheaper than a system call. Hence if instead of doing the expensive &lt;code&gt;sock_send&lt;/code&gt;, &lt;code&gt;sock_shutdown&lt;/code&gt; and &lt;code&gt;fd_close&lt;/code&gt; on our main thread we just throw them over to a secondary slave thread to do our dirty-laundry we could actually seem measurable performance increases. Which is likely the main reason why Winter.js and Bun can pull ahead - because they're likely both doing this.&lt;/p&gt;

&lt;p&gt;This is also likely the reason why Winter.js in wasm is super slow, because the multithreaded model in Web Assembly might not be highly optimised, hence the communication between threads could end up being more costly that just running the system call.&lt;br&gt;
This would get us the exact results we saw in our first graph.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Just like I said in the beginning there is a big chance that current web assembly performance will increase at the programming language level, I think there is still room for improvement based on these graphs.&lt;br&gt;
Web Assembly didn't start with a multithreading specification, it was added later and is still a flag you have to enable on some runtimes, so it makes sense that it might not be well optimised yet.&lt;br&gt;
This is then likely compounded by the fact no programming language is probably using the existing multithreaded systems properly, so the optimisation focused is more on the languages rather than the runtimes.&lt;/p&gt;

&lt;p&gt;I don't think Web Assembly will ever reach the performance of native, but that's not the point, all it needs to be is on par with the performance of current virtualisation platforms.&lt;br&gt;
Based on the fact that we can already touch Node performance, the currently available runtimes are suitable for a lot of current server workloads - the question is if it can get to the point where it's applicable for all server work loads.&lt;br&gt;
Where you can just push your complete server bundled as single wasm binary, specify a few environment variables and let the data centers handle it from there.&lt;/p&gt;




&lt;p&gt;Source code for benchmarks and raw results can be found in the &lt;a href="https://www.ajanibilby.com/blog/the-upper-limit-of-wasm-performance/appendix"&gt;Appendix&lt;/a&gt;&lt;/p&gt;

</description>
      <category>webassembly</category>
      <category>performance</category>
      <category>http</category>
    </item>
    <item>
      <title>Wasm is not going to save Javascript</title>
      <dc:creator>Ajani James Bilby</dc:creator>
      <pubDate>Thu, 20 Jul 2023 05:34:01 +0000</pubDate>
      <link>https://dev.to/ajanibilby/wasm-is-not-going-to-save-javascript-2gp0</link>
      <guid>https://dev.to/ajanibilby/wasm-is-not-going-to-save-javascript-2gp0</guid>
      <description>&lt;p&gt;This article is a case study of the performance impact of improving &lt;a href="https://bnf-parser.ajanibilby.com/"&gt;bnf-parser&lt;/a&gt; library to be able to take a given &lt;a href="https://en.wikipedia.org/wiki/Backus%E2%80%93Naur_form"&gt;BNF syntax&lt;/a&gt; input and compile it all the way down to an optimised parser in wasm for execution - to improve parse times of arbitrary syntaxes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing Methodology
&lt;/h2&gt;

&lt;p&gt;For each round of testing we will parse two rather complex BNFs (&lt;a href="https://github.com/AjaniBilby/BNF-parser/blob/main/test/bnf/sequalize.bnf"&gt;sequalize&lt;/a&gt;, &lt;a href="https://github.com/AjaniBilby/BNF-parser/blob/main/test/bnf/lolcode.bnf"&gt;lolcode&lt;/a&gt;) via three different parsing methods sequentially. Measuring the total parse time for each method using parse hooks. The library itself is actually boot strapped (it compiles and uses itself), and the first stage of compiling a BNF syntax is parsing it - so this is a valid test case for potential parsers generated by the library.&lt;/p&gt;

&lt;p&gt;All three parse methods are tested per round to keep the execution of each method closely coupled to each other to mitigate the impacts of background processes, and V8 optimisations so that these external factors will hopefully affect each of the parsers similarly.&lt;/p&gt;

&lt;p&gt;The first two parsers are actually the same wasm compiled parser in two different modes, the first with source mapping enabled, and the second one without. Source mapping is an optional extra parse which can be applied to the wasm parser which correctly maps syntax nodes of the tree to the &lt;code&gt;column&lt;/code&gt;, &lt;code&gt;row&lt;/code&gt;, and &lt;strong&gt;javascript&lt;/strong&gt; &lt;code&gt;index&lt;/code&gt; (don't get me started on UTF-16) it spans. This is an optional extra parse in &lt;a href="https://bnf-parser.ajanibilby.com/"&gt;bnf-parser&lt;/a&gt; because it allows the parser to not waste time allocating extra reference objects which aren't necessary for applications which don't need syntax error messages.&lt;/p&gt;

&lt;p&gt;These compiled parsers have also had the default optimisations applied within &lt;a href="https://www.npmjs.com/package/binaryen"&gt;binaryen&lt;/a&gt; which should hopefully give them an advantage over the Javascript implementation (assuming V8 optimisations don't have their way).&lt;/p&gt;

&lt;p&gt;The third parser is using &lt;a href="https://bnf-parser.ajanibilby.com/"&gt;bnf-parser&lt;/a&gt;'s legacy parser which behaves kind of like a graph traversal completely in Javascript where the graph structure is generated from a BNF, and the resulting syntax tree for a given input is generated based on this graph traversal (like a &lt;a href="https://en.wikipedia.org/wiki/Deterministic_finite_automaton"&gt;DFA&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;We ran the testing round &lt;code&gt;10,000&lt;/code&gt; consecutive times, gathering results using NodeJS &lt;a href="https://nodejs.org/api/perf_hooks.html"&gt;perf_hooks&lt;/a&gt;, we then also ran the tests a second time with more in-depth hooks put into the &lt;a href="https://bnf-parser.ajanibilby.com/artifact/"&gt;bnf-parser artifacts&lt;/a&gt; to see exactly what's going on under the hood. These performance measurements where not taken in the tests comparing the different parsers as the act of measuring them would heavily impact the overall performance of the wasm results, they're just that fast.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;From the results below we can see a few interesting trends, both the &lt;code&gt;wasm w/ mapping&lt;/code&gt; parser and the &lt;code&gt;legacy&lt;/code&gt; parse both have significantly higher 99% execution times than their 1%. This is due to the fact both of them are receiving a lot of love from V8's excellent optimiser. For the first couple of runs it's slow, but once the JIT realises it's doing the same thing many times it starts to optimise it.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;We also did a test where after each testing round we attempted to parse another random non bnf syntax, to see if it threw off V8's optimisations due to the graph traversal functions now running on a different graph to the one it was optimised for. However that had no negligible effect.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Wasm &lt;code&gt;w/ map&lt;/code&gt;
&lt;/th&gt;
&lt;th&gt;Wasm &lt;code&gt;no map&lt;/code&gt;
&lt;/th&gt;
&lt;th&gt;Legacy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Max&lt;/td&gt;
&lt;td&gt;6.1372ms&lt;/td&gt;
&lt;td&gt;2.5623ms&lt;/td&gt;
&lt;td&gt;13.5988ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;99%&lt;/td&gt;
&lt;td&gt;2.3481ms&lt;/td&gt;
&lt;td&gt;0.8897ms&lt;/td&gt;
&lt;td&gt;2.2354ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;50%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.5971ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.6350ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.6305ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1%&lt;/td&gt;
&lt;td&gt;0.6533ms&lt;/td&gt;
&lt;td&gt;0.2202ms&lt;/td&gt;
&lt;td&gt;0.4673ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Min&lt;/td&gt;
&lt;td&gt;0.6437ms&lt;/td&gt;
&lt;td&gt;0.2173ms&lt;/td&gt;
&lt;td&gt;0.4602ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mean&lt;/td&gt;
&lt;td&gt;1.2774ms&lt;/td&gt;
&lt;td&gt;0.5384ms&lt;/td&gt;
&lt;td&gt;1.2102ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Comparing the median &lt;code&gt;legacy&lt;/code&gt; times to &lt;code&gt;wasm no map&lt;/code&gt; we can see an approximate &lt;code&gt;2.56x&lt;/code&gt; - however only &lt;code&gt;legacy&lt;/code&gt; is generating &lt;code&gt;SyntaxNode&lt;/code&gt; references, so it's not a fair comparison of equivalent compute. Comparing the &lt;code&gt;wasm&lt;/code&gt; parser with source mapping to &lt;code&gt;legacy&lt;/code&gt; we see only a &lt;code&gt;1.04x&lt;/code&gt; improvement.&lt;/p&gt;

&lt;p&gt;And that might get you thinking - wow, JS really isn't that slow. But comparing &lt;code&gt;wasm&lt;/code&gt; to raw JS performance isn't fair either, because you're missing a step. You need to move data in and out of the JS world to the &lt;code&gt;wasm&lt;/code&gt; instance, and that has a tax.&lt;/p&gt;

&lt;h3&gt;
  
  
  The transport tax
&lt;/h3&gt;

&lt;p&gt;There are four main stages of using the &lt;code&gt;wasm&lt;/code&gt; &lt;a href="https://bnf-parser.ajanibilby.com/"&gt;bnf-parser&lt;/a&gt;;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Encoding&lt;/strong&gt;: This is where you take the input data from Javascript, and write it into the &lt;code&gt;wasm&lt;/code&gt; instance's memory, and also tell the instance how long the data you just put in is&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parsing&lt;/strong&gt;: This is the actual work we want to complete, this is iterating over the string and generating the entire syntax tree&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decoding&lt;/strong&gt;: We want to be able to use that tree in JS, so we need to load it back out to be useful - bring it back over to JS land.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mapping&lt;/strong&gt;: This is generating the source references for a given syntax tree, based on the input data.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;It's important to note that the &lt;code&gt;mapping&lt;/code&gt; part is almost entirely done in Javascript rather than in &lt;code&gt;wasm&lt;/code&gt;, because the computation is super simple, you're just iterating forward over a string counting the index as you depth first traverse over the syntax tree filling in the references (&lt;em&gt;this is done using stack operations, so it's a single function call to save on the extra tax of recursive calls in JS&lt;/em&gt;).&lt;br&gt;&lt;br&gt;
Since the majority of the complex work being done is simply allocating new objects to store the reference at each point - there will be no real time-saved by doing this in WASM, and any time saved will be mostly lost due to the data transfer tax.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Encode&lt;/th&gt;
&lt;th&gt;Parse&lt;/th&gt;
&lt;th&gt;Decode&lt;/th&gt;
&lt;th&gt;Mapping&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Max&lt;/td&gt;
&lt;td&gt;0.2647ms&lt;/td&gt;
&lt;td&gt;0.2692ms&lt;/td&gt;
&lt;td&gt;2.6253ms&lt;/td&gt;
&lt;td&gt;3.5991ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;99%&lt;/td&gt;
&lt;td&gt;0.0064ms&lt;/td&gt;
&lt;td&gt;0.1443ms&lt;/td&gt;
&lt;td&gt;1.0704ms&lt;/td&gt;
&lt;td&gt;1.1919ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;50%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.0026ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.1335ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.6914ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.9063ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1%&lt;/td&gt;
&lt;td&gt;0.0023ms&lt;/td&gt;
&lt;td&gt;0.0436ms&lt;/td&gt;
&lt;td&gt;0.1738ms&lt;/td&gt;
&lt;td&gt;0.4232ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Min&lt;/td&gt;
&lt;td&gt;0.0020ms&lt;/td&gt;
&lt;td&gt;0.0428ms&lt;/td&gt;
&lt;td&gt;0.1720ms&lt;/td&gt;
&lt;td&gt;0.4160ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mean&lt;/td&gt;
&lt;td&gt;0.0032ms&lt;/td&gt;
&lt;td&gt;0.0910ms&lt;/td&gt;
&lt;td&gt;0.5242ms&lt;/td&gt;
&lt;td&gt;0.7071ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;From this data we can see we are spending &lt;code&gt;40.03%&lt;/code&gt; of our time just moving data between JS to WASM land - that's almost half of the entire computation.&lt;br&gt;
We can also see the other &lt;code&gt;55.41%&lt;/code&gt; is taken up by generating the source references.&lt;br&gt;&lt;br&gt;
Leaving only &lt;code&gt;7.70%&lt;/code&gt; of the time we took to run this parse actually computing the syntax tree!!!&lt;/p&gt;

&lt;h2&gt;
  
  
  What's up with Javascript's GST rates being so high?
&lt;/h2&gt;

&lt;p&gt;The reason this tax for transferring data is so high is painfully illustrated by the difference in compute time between &lt;code&gt;decode&lt;/code&gt; and &lt;code&gt;mapping&lt;/code&gt;. Mapping is much simpler, than trying to traverse a tree generated by a different language where you need to worry about bit alignment, as well as actually decoding the foreign data into something Javascript can use.&lt;/p&gt;

&lt;p&gt;The reason is object allocation.&lt;br&gt;&lt;br&gt;
Everything in Javascript is an object, even the object within an object is an object - and that's objectively a problem.&lt;/p&gt;

&lt;p&gt;In any statically typed language if you allocate a &lt;code&gt;struct&lt;/code&gt; which has another &lt;code&gt;struct&lt;/code&gt; as it's member, you get that child for free. That isn't the case in Javascript. Every &lt;code&gt;SyntaxNode&lt;/code&gt;, has a &lt;code&gt;ReferenceRange&lt;/code&gt; which contains two &lt;code&gt;Reference&lt;/code&gt; objects - so that means if you want to allocate a &lt;code&gt;SyntaxNode&lt;/code&gt; and fill in all of it's children, that's actually &lt;code&gt;4&lt;/code&gt; allocations, not &lt;code&gt;1&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The reason decoding is able to be as fast as it is; is because of object reuse. By default every single &lt;code&gt;SyntaxNode&lt;/code&gt; actually shares the same &lt;code&gt;ReferenceRange&lt;/code&gt; instance, that means that range and it's two children only need to be allocated once, and every &lt;code&gt;SyntaxNode&lt;/code&gt; gets a &lt;code&gt;ReferenceRange&lt;/code&gt; so now you don't need to do null checks everywhere - and we only have one allocation per node.&lt;/p&gt;

&lt;p&gt;But when you run the source map over the syntax tree, now for every single &lt;code&gt;SyntaxNode&lt;/code&gt; you have to perform &lt;code&gt;3&lt;/code&gt; allocations: &lt;code&gt;ReferenceRange&lt;/code&gt;, start &lt;code&gt;Reference&lt;/code&gt;, and end &lt;code&gt;Reference&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Part of the reason the execution in &lt;code&gt;wasm&lt;/code&gt; is actually so fast is because it only does one allocation, the entire tree itself.&lt;br&gt;
The entire tree represented in &lt;code&gt;wasm&lt;/code&gt; is actually flat packed into linear memory. And since after every parse the data is read out, we don't need the previous tree after each parse - so we can just write over it. So we have zero allocations because we just use the same single allocation. In other languages line &lt;code&gt;C++&lt;/code&gt; you could allocate a vector a factor or two larger than your estimated tree size, then compute your flat tree, then shrink afterwards. Two allocations.&lt;/p&gt;

&lt;p&gt;In Javascript everything is an object, everything must be &lt;strong&gt;independently&lt;/strong&gt; allocated.&lt;/p&gt;

&lt;h2&gt;
  
  
  Can Wasm still work as an accelerator?
&lt;/h2&gt;

&lt;p&gt;Wasm libraries can still work as an accelerator to Javascript, in almost an identical way to how everything in Python is actually a C library. You could have a library for matrix multiplication, and all of your matrices are permanently stored in WASM, only coming out after computation is complete to be printed, sent over the network, or written to file.&lt;/p&gt;

&lt;p&gt;So much like the current Python eco system, JS could lead towards a world where it's a glue language - the problem is that for typical Javascript workflows it's &lt;code&gt;90%&lt;/code&gt; glue.&lt;/p&gt;

&lt;p&gt;For the vast majority of Javascript execution it's:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Take something from the network&lt;/li&gt;
&lt;li&gt;Perform a small amount of manipulation&lt;/li&gt;
&lt;li&gt;Send it out to the network, or write to the DOM.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;JS is primarily a middle man language, attempting to use it like python to abstract the middle man's duty to another person creating another middle man leads to very little performance gain, and a whole lot of headache. Try talking to the tech support of any major tech company and you'll see what I mean.&lt;/p&gt;

&lt;p&gt;Wasm also has another extra headache. It's security focused, meaning every wasm instance has it's own independent memory, that means two different wasm libraries can't actually share memory, unless they are recompiled together, or you parse data between wasm libraries much like you do from JS to wasm.&lt;/p&gt;

&lt;p&gt;Plus the majority of wasm compiled modules don't actually play nicely together, they're compiled to rule their sandpit, and no one else can enter. If you attempted to bring a C++ library and Rust library into the same WASM module, who's &lt;code&gt;malloc&lt;/code&gt; implementation are we using? There is only one linear memory, and they can't both be operating on the same space.&lt;br&gt;
Who's are we choosing? How do we choose? How do each of the children know who was chosen?&lt;/p&gt;

&lt;h2&gt;
  
  
  Wasm is what people wanted docker to be
&lt;/h2&gt;

&lt;p&gt;Wasm is a really powerful tool, but I think people are miss understanding where it's heading and what it will be great for.&lt;br&gt;
No it will not be great for bringing that one Rust library to use in your TS workflow.&lt;br&gt;
It's better to think of it like a super light weight and actually portable docker container that can execute anywhere.&lt;/p&gt;

&lt;p&gt;You can bring that container into the browser to act as your front end, or you can have it running as micro-service, or the entire backend.&lt;br&gt;
What it's not is a way to use language &lt;code&gt;X&lt;/code&gt;'s library in language &lt;code&gt;Y&lt;/code&gt;.&lt;/p&gt;

</description>
      <category>webassembly</category>
      <category>performance</category>
      <category>javascript</category>
    </item>
    <item>
      <title>Async functions are needlessly killing your Javascript performance</title>
      <dc:creator>Ajani James Bilby</dc:creator>
      <pubDate>Sat, 01 Apr 2023 13:00:00 +0000</pubDate>
      <link>https://dev.to/ajanibilby/async-functions-are-needlessly-killing-your-javascript-performance-20g2</link>
      <guid>https://dev.to/ajanibilby/async-functions-are-needlessly-killing-your-javascript-performance-20g2</guid>
      <description>&lt;p&gt;While numerous articles offer quick tips to enhance your async JavaScript performance using extra Promise API features, this discussion focuses on how program structure tweaks can lead to significant speed improvements. By optimizing your code, you could potentially achieve 1.9x or even 14x speed boosts.&lt;/p&gt;

&lt;p&gt;I believe the untapped performance potential in asynchronous JavaScript features is due to the V8 engine not providing the expected level of optimization for these features; and there are a few key indicators that suggest this possibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context
&lt;/h2&gt;

&lt;p&gt;You can skip this section if you want, but here's a brief overview of the context. I've been working with a &lt;a href="https://www.npmjs.com/package/bnf-parser"&gt;bnf-parser&lt;/a&gt; library that currently needs a complete file to be loaded for parsing it into a BNF-specified syntax tree. However, the library could be refactored to use cloneable state generators, which output file characters sequentially and allow for copying at a specific point to resume reading later.&lt;/p&gt;

&lt;p&gt;So I tried to implementing it in Javascript be able to parse large +1GB files into partial syntax trees for processing large XML, just partly for fun, partly because I know also soon I'll need to be implementing something similar in a lower level language and this could be good practice.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Case Study
&lt;/h2&gt;

&lt;p&gt;I aimed to create a layer between the readable data stream from disk and allowing iteratively calling forward for small text portions with limited backtracking. I implemented a &lt;a href="https://github.com/AjaniBilby/BNF-parser/blob/350e9a00fc4ca06acc98245377fb705f00d286b8/source/lib/cache.ts#L7-L45"&gt;Cursor&lt;/a&gt; that iterates forward, returning the passed-over characters as a string. Cursors can be cloned, and clones can independently move forward. Importantly cursors &lt;em&gt;may&lt;/em&gt; need to wait for data currently streamed to become available before returning the next substring. To minimize memory usage, we discard unreachable data - implementing all of this into a async/await pattern to avoid complex callback chains or unnecessary event loop blocking.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Side note: We use pooling for caching, placing each chunk read from the disk into an array and manipulating the array to free cached data. This method reduces resize operations and string manipulation. However, it can cause NodeJS to report false memory usage, as chunks allocated by the OS are sometimes not counted until manipulated within the application domain.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The cursor features an async read call, asynchronously connecting to a &lt;a href="https://github.com/AjaniBilby/BNF-parser/blob/350e9a00fc4ca06acc98245377fb705f00d286b8/source/lib/cache.ts#L52-L271"&gt;StreamCache&lt;/a&gt; to read from the cache. Multiple cursors may attempt to read the latest unavailable information, requiring a &lt;a href="https://en.cppreference.com/w/cpp/thread/condition_variable"&gt;condition variable&lt;/a&gt; lock - an async call to a &lt;a href="https://github.com/AjaniBilby/BNF-parser/blob/350e9a00fc4ca06acc98245377fb705f00d286b8/source/lib/promise-queue.ts"&gt;PromiseQueue&lt;/a&gt; is used to manages this.&lt;/p&gt;

&lt;p&gt;Reading a &lt;code&gt;1GB&lt;/code&gt; file in &lt;code&gt;100-byte&lt;/code&gt; chunks leads to at least &lt;code&gt;10,000,000 IOs&lt;/code&gt; through three async call layers. The problem becomes catastrophic since these functions are essentially language-level abstractions of callbacks, lacking optimizations that come with their async nature. However, we can manually implement optimizations to alleviate this issue.&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing
&lt;/h2&gt;

&lt;p&gt;So let's go through the base implementation, then a few different variations and optimisations; or you can skip ahead to the #results then work your way backwards if you prefer.&lt;/p&gt;

&lt;p&gt;A quick note about the testing methodology: Each test ran 10 times consecutively, starting from a cold state. The first result was consistently slower, while the other nine were nearly identical. This suggests either NodeJS temporarily saves optimized code between runs, or the NAS intelligently caches the file for quicker access. The latter is more likely, as longer durations between cold starts result in slower initial executions.&lt;/p&gt;

&lt;p&gt;The test file used is &lt;a href="https://dumps.wikimedia.org/simplewiki/latest/simplewiki-latest-pages-articles.xml.bz2"&gt;here&lt;/a&gt; (streamed as a standalone XML file).&lt;/p&gt;

&lt;h3&gt;
  
  
  Full Async
&lt;/h3&gt;

&lt;p&gt;So we have a cursor which we can call next on, which forwards the request to the &lt;a href="https://github.com/AjaniBilby/BNF-parser/blob/350e9a00fc4ca06acc98245377fb705f00d286b8/source/lib/cache.ts#L52-L271"&gt;StreamCache&lt;/a&gt; - which then handles all of the actual read behaviour.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Cursor&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;highWaterMark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_owner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;highWaterMark&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="c1"&gt;// ...&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We then have our main file which just creates a &lt;a href="https://github.com/AjaniBilby/BNF-parser/blob/350e9a00fc4ca06acc98245377fb705f00d286b8/source/lib/cache.ts#L52-L271"&gt;StreamCache&lt;/a&gt;, adds a cursor, and piping a &lt;code&gt;fs.createReadStream&lt;/code&gt; in a kind of backwards way to the normal piping API, but this is due to the way &lt;code&gt;StreamCache&lt;/code&gt; has been implemented to allow for NodeJS and WebJS readable stream API differences.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The cursor is added before piping to ensure the first bytes of data can't be read into the cache, then dropped because of it being inaccessible by any cursors&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;experimental&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;StreamCache&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;cursorA&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pipe_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fstream&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;duration&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;cursorA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isDone&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;cursorA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;read&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;val&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;cursorA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;fstream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;end&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;timeEnd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;duration&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Wrapper Optimisation
&lt;/h3&gt;

&lt;p&gt;In the cursor before we could see we had an async function basically just acting as a wrapper, if you understand the async abstraction you'd know an async function just returns a promise, so there is no actual need in creating this extra async function, and instead we can just return the one created from the child call. (This has a level of performance benefit it really shouldn't :D)&lt;/p&gt;

&lt;p&gt;To:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Cursor&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;highWaterMark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_owner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;highWaterMark&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Inlined
&lt;/h3&gt;

&lt;p&gt;In this case we pretended to be a compiler, and inlined our own function, so we literally just embedded the functionality of &lt;code&gt;StreamCache._read&lt;/code&gt; into where it was being called, which completely broken our public private attribute protections 😶🔫&lt;br&gt;&lt;br&gt;
If only there was a compiler like &lt;em&gt;Typescript&lt;/em&gt; to do inlining safely for us 👀&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;experimental&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;StreamCache&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;cursorA&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pipe_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fstream&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;duration&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;cursorA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isDone&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cursorA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_offset&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Cursor behind buffer position&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cursorA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_offset&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_total_cache&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_ended&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;

      &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_signal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;loc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_offset_to_cacheLoc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cursorA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_offset&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nx"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;cursorA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_offset&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;read&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;cursorA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="nx"&gt;fstream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;end&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;timeEnd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;duration&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Async With Peaking
&lt;/h3&gt;

&lt;p&gt;If all else fails, avoid async when possible. So in this case I added a few functions.&lt;br&gt;&lt;br&gt;
Peak will tell me if I can read without waiting, and in which case &lt;code&gt;_skin_read&lt;/code&gt; is safe to call.&lt;br&gt;&lt;br&gt;
Otherwise go back to calling the async method.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;experimental&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;StreamCache&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;cursorA&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pipe_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fstream&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;duration&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;cursorA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isDone&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;cursorA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_skip_read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cursorA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isDone&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;read&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;val&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;peaked&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;val&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;val&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;val&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;cursorA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;read&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;val&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;cursorA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="nx"&gt;fstream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;end&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;timeEnd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;duration&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this use case this actually save a lot of time because a large amount of the calls didn't actually need to wait due to the load chunk sizes being so large.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Bits Read&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Via Async&lt;/td&gt;
&lt;td&gt;919417&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Via Peaking&lt;/td&gt;
&lt;td&gt;1173681200&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total&lt;/td&gt;
&lt;td&gt;1174600617&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Disk Read
&lt;/h3&gt;

&lt;p&gt;As with all good tests, we need a base line - so in this case we don't even have an active cursor, we literally just let data flow in and out of the &lt;code&gt;StreamCache&lt;/code&gt; as fast as possible giving us the limitation of our disk read, plus the &lt;code&gt;alloc&lt;/code&gt; and &lt;code&gt;free&lt;/code&gt; overhead as we add and remove cache pools.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;experimental&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;StreamCache&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;cursorA&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pipe_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fstream&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;duration&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;cursorA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="nx"&gt;fstream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;end&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;timeEnd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;duration&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Callback
&lt;/h3&gt;

&lt;p&gt;Finally we need a test to make sure this isn't a de-optimisation bug, if we go back to the callback hell days, however do we fair?  &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: I didn't rewrite the &lt;code&gt;signal.wait()&lt;/code&gt; as trying to create an optimised call back system inside a for loop will be hell on earth to implement.&lt;br&gt;
And yes we do need a while loop, because it might take more than one chunk to load in to fulfill the requested read - chunk sizes can be weird sometimes and inconsistent, plus maybe you just want a large chunk read at once 🤷&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;StreamCache&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Cursor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;size&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;callback&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_offset&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Cursor behind buffer position&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Wait for more data to load if necessary&lt;/span&gt;
    &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_offset&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_total_cache&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;size&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="c1"&gt;// The required data will never be loaded&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_ended&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;

      &lt;span class="c1"&gt;// Wait for more data&lt;/span&gt;
      &lt;span class="c1"&gt;//   Warn: state might change here (including cursor)&lt;/span&gt;
      &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_signal&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;wait&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Return the data&lt;/span&gt;
    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;loc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_offset_to_cacheLoc&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_offset&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nf"&gt;callback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_cache&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]].&lt;/span&gt;&lt;span class="nf"&gt;slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nx"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="nx"&gt;size&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nx"&gt;cursor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_offset&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;callback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;out&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;ittr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;read&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cursorA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isDone&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;cursorA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cursorA&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ittr&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;time&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;duration&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;ittr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="nx"&gt;fstream&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;on&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;end&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;=&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;timeEnd&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;duration&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Case&lt;/th&gt;
&lt;th&gt;Duration (Min)&lt;/th&gt;
&lt;th&gt;Median&lt;/th&gt;
&lt;th&gt;Mean&lt;/th&gt;
&lt;th&gt;Max&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;#Full Async&lt;/td&gt;
&lt;td&gt;27.742s&lt;/td&gt;
&lt;td&gt;28.339s&lt;/td&gt;
&lt;td&gt;28.946s&lt;/td&gt;
&lt;td&gt;35.203s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;#Async Wrapper Opt&lt;/td&gt;
&lt;td&gt;14.758s&lt;/td&gt;
&lt;td&gt;14.977s&lt;/td&gt;
&lt;td&gt;15.761s&lt;/td&gt;
&lt;td&gt;22.847s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;#Callback&lt;/td&gt;
&lt;td&gt;13.753s&lt;/td&gt;
&lt;td&gt;13.902s&lt;/td&gt;
&lt;td&gt;14.683s&lt;/td&gt;
&lt;td&gt;21.909s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;#Inlined Async&lt;/td&gt;
&lt;td&gt;2.025s&lt;/td&gt;
&lt;td&gt;2.048s&lt;/td&gt;
&lt;td&gt;3.037s&lt;/td&gt;
&lt;td&gt;11.847s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;#Async w/ Peaking&lt;/td&gt;
&lt;td&gt;1.970s&lt;/td&gt;
&lt;td&gt;2.085s&lt;/td&gt;
&lt;td&gt;3.054s&lt;/td&gt;
&lt;td&gt;11.890s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;#Disk Read&lt;/td&gt;
&lt;td&gt;1.970s&lt;/td&gt;
&lt;td&gt;1.996s&lt;/td&gt;
&lt;td&gt;2.982s&lt;/td&gt;
&lt;td&gt;11.850s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;It's kind of terrifying how well changing just the wrapper function &lt;code&gt;Cursor.next&lt;/code&gt; is, it shows that there is easily optimisation improvements available, that plus the inlining &lt;code&gt;13.9x&lt;/code&gt; performance improvement shows that there is room that even if V8 doesn't get around to implementing something, tools like Typescript certainly could.&lt;/p&gt;

&lt;p&gt;Also if you look at the peaking example, we hit quite an interesting limit. In that case only &lt;code&gt;0.078%&lt;/code&gt; of requests were fulfilled by the async function, meaning only about &lt;code&gt;9194&lt;/code&gt; of &lt;code&gt;11746006&lt;/code&gt; requests were waiting for the data to be loaded. This would imply our CPU is almost perfectly being feed by the incoming data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The performance of asynchronous JavaScript functions can be significantly improved by making simple tweaks to the code. The results of this case study demonstrate the potential for 1.9x to 14x speed boosts with manual optimizations. The V8's current lack of optimization for these features leaves room for further improvements in the future.&lt;/p&gt;

&lt;p&gt;When using direct raw &lt;code&gt;Promise&lt;/code&gt; API calls, there can be a strong argument made that attempting to optimise this behaviour without potentially altering execution behaviour can be extraordinarily hard to implement. But when we use the &lt;code&gt;async&lt;/code&gt;/&lt;code&gt;await&lt;/code&gt; syntax without even using the term &lt;code&gt;Promise&lt;/code&gt;, our functions are now written in such a way you can make some pretty easy performance guaranteed optimisations.&lt;/p&gt;

&lt;p&gt;The fact that simply #altering the wrapper call creates an almost 1.9x boost in performance should be horrifying for anyone who has used a compiled language. It's a simple function call redirection and can be easily optimised out of existence in most cases.&lt;/p&gt;

&lt;p&gt;We don't need to wait for the browsers to implement these optimisations, tools such as Typescript already offer transpiling to older ES version, clearly showing the compiler infrastructure has a deep understanding of the behaviour of the language. For a long time people have been saying that Typescript doesn't need to optimise your Javascript, since V8 already does such a good job, however that clearly isn't the case with this new async syntax - and with a little bit of static analysis an inlining alone Javascript can become way more performant.&lt;/p&gt;

&lt;h2&gt;
  
  
  Take Away
&lt;/h2&gt;

&lt;p&gt;Currently in V8's implementation of Javascript, &lt;code&gt;async&lt;/code&gt; is just an abstraction of &lt;code&gt;Promise&lt;/code&gt;s, and &lt;code&gt;Promise&lt;/code&gt;s are just an abstraction of callbacks, and V8 doesn't appear to use the added information that an &lt;code&gt;async&lt;/code&gt; function provides over a traditional callback to make any sort of optimisations.&lt;/p&gt;

&lt;p&gt;While the majority of active async Javascript code is probably IO bounded instead of CPU, this likely won't affect the majority of Javascript code. However your code can still potentially be limited by these performance characteristics even if you're not the one doing the heavy CPU load. Potentially based on how you to interface with a given library could give you massively different performance characteristics depending no if you're using non-synchronous code or not, and the problem can be exacerbated depending on the implementation details of the library.&lt;/p&gt;

&lt;h3&gt;
  
  
  What you can do now
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;As a general rule, try and avoid async when possible - and no callbacks are not the solution, because it has the same performance impact.&lt;/li&gt;
&lt;li&gt;When possible instead of creating a new Promise bounded by another - attempt to merge them into a single Promise when possible.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>javascript</category>
      <category>async</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
