Rust is becoming more and more the language of choice to build tools for the web. Last example to date, Rome announced it will be using Rust.
Historically, Rust has also been one of the languages of choice to target WebAssembly, which has now shipped in all major browsers. One of the main benefit of using WebAssembly is that it is more performant than plain JavaScript code, in most cases. Hence, the idea to try and optimise the latest library I released (https://github.com/AntonioVdlC/range) by re-writting it in Rust!
But first things first. A very smart person once said that you could only improve what you could measure. So before going any further, let's look at how we can measure the performance of the @antoniovdlc/range
library.
There are a few good options to run benchmarks in Node (for example, the aptly named benchmark library, or tiny-benchy which is used by Parcel), but for the sake of this exercise, let look into a lower-level API and directly use Node's perf_hooks
#!/usr/bin/env node
const { performance, PerformanceObserver } = require("perf_hooks");
const range = require("../dist/index.cjs");
const testBenchmark = performance.timerify(function testBenchmark() {
let sum = 0;
let i = 0;
const r = range(0, process.env.SIZE);
while (!r.next().done) {
sum += i;
i++;
}
return sum;
});
const obs = new PerformanceObserver((list) => {
const entries = list.getEntries();
const avgDuration =
entries.reduce((sum, cur) => (sum += cur.duration), 0) / entries.length;
console.log(`range(0, ${process.env.SIZE}): ${avgDuration}s`);
obs.disconnect();
});
obs.observe({ entryTypes: ["function"] });
for (let i = 0; i < 1000; i++) {
testBenchmark();
}
What the code above does is run 1,000 times a function which loops over a range of a given size and does a simple sum operation in each iteration. The benchmark is then calculated as the average time of all those 1,000 runs.
With the in hand, let's first look at the current implementation's performance:
range(0, 100): 0.007962769627571106s
range(0, 1000): 0.015898147106170653s
range(0, 10000): 0.08853049981594086s
range(0, 100000): 0.8147728093862534s
range(0, 1000000): 7.5012646638154985s
Honestly, not too shabby! Can we do better with Rust and WebAssembly?
If you want to follow along, please make sure to install Rust and its toolchain.
To compile our Rust code to WebAssembly, we will be using wasm-pack
It can be installed either with Cargo, or directly via npm:
npm i -D wasm-pack
We can then add the following script to our package.json
:
{
...
"scripts": {
...
"build:wasm": "wasm-pack build --target nodejs"
}
}
Now let's write some Rust code!
The first thing we will do is declare a struct called Range
, which would be very similar to our implementation of ranges in JavaScript.
#[wasm_bindgen]
pub struct Range {
_start: i32,
_stop: i32,
_step: i32,
_inclusive: bool,
// Counter used for iteration, so that we can iterate multiple times over
// the same range
i: i32,
}
#[wasm_bindgen]
impl Range {
#[wasm_bindgen(constructor)]
pub fn new(start: i32, stop: i32, step: i32, inclusive: bool) -> Range {
Range {
_start: start,
_stop: stop,
_step: if step != 0 { step } else { 1 },
_inclusive: inclusive,
i: start,
}
}
}
To surface a similar API to what we first implemented in JavaScript, we also write the following range
function:
#[wasm_bindgen]
pub fn range(start: i32, stop: i32, step: i32, inclusive: bool) -> Result<Range, JsValue> {
if start > stop {
return Err(Error::new(
(format!("Cannot create a range from {} to {}", start, stop)).as_str(),
)
.into());
}
return Ok(Range::new(start, stop, step, inclusive));
}
We can go on and implement the getters and other methods, but before investing too much on this exercise, let's focus on implementing the .next()
method so that we can run our benchmarks on the compile WebAssembly code.
#[wasm_bindgen]
pub struct JsIteratorResult {
pub value: Option<i32>,
pub done: bool,
}
#[wasm_bindgen]
impl Range {
#[wasm_bindgen]
pub fn next(&mut self) -> JsIteratorResult {
if self._inclusive && self.i <= self._stop || self.i < self._stop {
let value = self.i;
self.i = self.i + self._step;
return JsIteratorResult {
value: Some(value),
done: false,
};
}
self.i = self._start;
return JsIteratorResult {
value: None,
done: true,
};
}
}
The above implementation is again extremely similar to the JavaScript code.
After compiling the above Rust code into WebAssembly, let's look at the benchmark ...
range(0, 100): 0.018000024318695067s
range(0, 1000): 0.09116293668746948s
range(0, 10000): 2.4152168154716493s
...
... and unfortunately, the numbers where more than disappointing.
It seems like the WebAssembly version of that specific library is orders of magnitude slower. This is probably mostly due to my inexperience with Rust and WebAssembly in general, and there are definitely ways to look deeper into what is causing such a lackluster performance, but it is also OK to fail, stop, and look for the next challenge!
This was an interesting experiment, and even though the end result wasn't as expected, it was still a great learning opportunity!
If you want to look and tinker around the full Rust code base, you can check out: https://github.com/AntonioVdlC/range/tree/wasm.
Maybe there are some obvious mistakes you can point me to!
Top comments (10)
This is most likely a bad choose of a library to try and optimize with wasm. Basically, wasm has an overhead cost for every call. If wasm is called frequently and does very little work, than the overhead of calling it will out weigh any speed increase. A range is called repeatedly in a hot loop and does very little work so it would mostly like be slower in wasm.
As a side note, I believe modern JS runtimes are really good at optimizing things like hot for loops. Wasm disrupts this because the runtime has no idea what wasm is doing and isn't able to optimize the hot loop.
Thanks for your insights!
I agree that in retrospect, this probably wasn't to most idoneous choice. I wasn't too aware of the overhead of the interface, so that's something I've learnt. The lack of proper computation in the code itself (simply incrementing a counter) definitely surfaced that overhead in a more prominent way.
Try writing something much more compute-expensive, like a particle simulator, raytracer, or fractal renderer.
The reason this is important is β as already mentioned β that the interface involves a lot of overhead, so calling it so frequently for something already so fast in JS is wasteful.
Consider this alongside the fact that V8 has many optimisations and that the standard library is mostly backed by native code: it's hardly surprising that JS competes.
Finally (also as noted by others), compiling in release mode is important here. This takes slightly longer but makes the comparison much more fair, because your WASM actually has a chance of being optimised to the point the JS and the native code behind it already is.
Thanks for the comment!
I think you've raised a lot of very insightful points. I definitely agree that the library I chose probably wasn't the best, and I wasn't too aware of the interface overhead (so I guess I learnt something here!). At the end of the day, the only computation happening was incrementing a counter, so as you said, modern JavaScript engines already do a fairly good job at optimising that!
First of all, thank you for documenting an experiment that failed! I learned a lot from this article and its comments. Successful experiments often bring fewer lessons learned.
Second, I applaud you for measuring performance before and after. Most articles I read about performance tuning ignore this.
Thanks! I've definitely learned a lot from the failed experiment and the comments, and I'm glad you found it useful too!
I completely agree with @asafigan , But I'm wondering if you have compiled the Rust/WSAM with Release flag and optimizations turned on.
Good point!
I did try to compile the Rust code adding the following lines to Cargo.toml:
Unfortunately that didn't improve the performance much.
It is very possible that this setting alone didn't do what I was thinking it was doing.
If you do the test loop in JS, it must be bad. The time is wasted at the interface
Thanks for the insight!
I think this was probably not the best candidate library to try and optimize. As you have pointed, computation is minimal and all the performance is wasted at the interface.