DEV Community: ScaleDynamics

Why Your Browser Benchmark is Lying to You About AI Performance

ScaleDynamics — Tue, 03 Feb 2026 07:58:00 +0000

For years, we’ve measured web performance through the lens of latency. How fast does this script load? How quickly can the engine execute this single loop? However, the "Document Web" is no longer active. We are now living in the "Compute Web" era—where browsers are expected to run local AI inference, process massive data streams, and handle complex UI states simultaneously.

Traditional benchmarks are like testing a 16-cylinder engine by checking the speed of a single piston. They don't tell you how the engine performs under actual load.

The Single-Task Fallacy: An Outdated View

Most benchmarks (like JetStream or Speedometer) focus on sequential execution. While they are great for measuring JS engine maturity and browser performance, they fail to account for Task Saturation.

Peak performance in a modern AI WebApp is all about how efficiently the browser can orchestrate concurrent, resource-intensive tasks, such as:

CPU Intensive Work: Pre-processing large datasets or 50MB JSON payloads in a Web Worker.
GPU Intensive Work: Running a local AI inference model using WebGPU.
Main Thread Work: Ensuring the UI remains responsive to keep the user interface smooth at 60fps.

To perform their best, modern web apps require a benchmark that tests this simultaneous load, because the true bottleneck is often not the raw speed of one component, but the efficient "handoff" and scheduling between all of them. Focusing on a single isolated variable, such as raw GPU speed, fails to capture the "Ultimate Performance" of the application under real-world conditions.

Technical Deep-Dive

At ScaleDynamics, we’ve observed that the true bottleneck in AI-driven web apps is often the "handoff" between the CPU and GPU.

We built SpeedPower.run out of frustration. Existing browser benchmarks are too synthetic and disconnected from the real-world challenges of the modern web, where real web applications perform heavy pre/post-processing, run multiple AI models, and handle critical rendering simultaneously. Our mission is simple: to create the definitive benchmark for real-world compute performance on the modern web.

The Simultaneous Load Methodology

SpeedPower.run determines your browser and device's maximum performance by pushing all CPUs and GPUs to their limit simultaneously.

Unlike other tools that test one thing at a time, we run multiple, concurrent tasks, like running AI inferences while also doing heavy JavaScript processing. We use all available web technologies: JavaScript, WASM, WebGL, and WebGPU.

How We Ensure a Fair Score (Methodology & Integrity):

The SpeedPower.run benchmark ensures a fair and accurate score by focusing purely on your device's computational power. It achieves this by guaranteeing Zero Network Interference, as the test timer only starts after all large assets, including the ~400MB of AI models, are loaded into memory. Additionally, it implements a Warm-up Execution phase before recording the final score, allowing the browser to finish its internal optimizations (like code compilation) to ensure the result reflects your device's peak performance, not initial slow-down.

To provide a reliable measurement, the methodology prioritizes Score Stability by using statistical regression analysis on peak metrics to smooth out system-level scheduling noise. This process generates a dependable result that is not based on a single moment in time. For users, the process is simple: since factors outside the benchmark's control (like the operating system) can affect performance, it is recommended to run the test multiple times to confidently capture the highest possible score your device can achieve.

The Benchmarks

SpeedPower.run consists of the following core benchmarks:

JavaScript: This benchmark measures raw computational power for pre/post-processing on JS objects and JSON. It utilizes four tests from the Apple/WebKit JetStream 2 suite: Access Binary Trees, Control Flow Recursive, Regexp DNA, and String Tag Cloud. We run these benchmarks in parallel across multiple Web Workers to measure the maximum multi-core CPU processing power.
AI with TensorFlow.js: We utilize TensorFlow.js to test the maturity and performance of established web AI pipelines.
- AI Recognition TFJS: Measures the steady-state inference throughput of the BlazeFace model (via TensorFlow.js). Using a 128x128 input tensor and a pre-warmed graph, this test isolates the raw performance of the backend (JavaScript, WASM, WebGL, or WebGPU). It specifically measures the speed of the forward pass and the subsequent interpretive post-processing (decoding the highest-confidence face detection).
- AI Classify TFJS: This benchmark measures the throughput of the MobileNetV3 Small architecture. Using a fixed 224x224 input tensor and a pre-warmed graph, this test isolates the raw performance of the backend (JavaScript, WASM, WebGL, or WebGPU). It specifically measures the speed of the forward pass and the subsequent interpretive post-processing (decoding the highest-confidence score).
AI with Transformers.js: SpeedPower.run pushes the boundaries of next-gen in-browser AI by leveraging Transformers.js v3 for our most advanced workloads.
- AI Classify Transformers: Measures the throughput of the MobileNetV4-Small architecture (via Transformers.js v3). It prioritizes a high-performance WebGPU backend (falling back to WebGL) with a fixed 224x224 input tensor. This score reflects the system's capacity for parallel inference, leveraging asynchronous command queues and compute shaders to process workloads with high concurrency.
- AI LLM Transformers: Measures the throughput of the SmolLM2-135M-Instruct causal language model (via Transformers.js v3). Using a 4-bit quantized (q4) ONNX model, this benchmark isolates the GPU runtime efficiency from model loading overhead. It captures the hardware's ability to orchestrate multi-threaded LLM execution and real-time autoregressive decoding.
- AI Speech Transformers: Measures the throughput of the Moonshine-Tiny automatic speech recognition (ASR) architecture. It uses a hybrid-precision model (FP32 encoder + q4 decoder) to isolate GPU runtime efficiency from audio processing overhead. The score highlights the capacity for complex, high-concurrency speech-to-text pipelines.
Exchange: Since modern apps rely on Web Workers, the "Exchange" benchmark measures the communication bottleneck between the main thread and workers. It tests the transfer speed of IPC, Transferables, Arrays, Buffers, Objects, and OffScreen Canvas. The higher the score, the more efficiently your main thread communicates with background workers.

Architecture: No Installation Required

We were adamant that this should require zero installation or setup. By leveraging WebAssembly (WASM) and WebGPU, we can access the bare metal of your device directly through the browser.

You don't need to download a 5GB suite to see if your rig is ready for the AI web. You just click, and in 30 seconds, we saturate every available thread to find your browser's breaking point for modern, complex applications.

Help Us Calibrate the Benchmark

We are currently collecting data across thousands of hardware/browser combinations to refine our scoring for the "Ultimate Performance" of the modern web.

We’ve seen some fascinating anomalies already, like high-end mobile ARM chips showing better task-switching efficiency than some mid-range x86 desktops due to better thermal-aware scheduling in the browser.
Run the test on your dev rig: https://speedpower.run

Does the result match your "real-world" multitasking experience? Drop your score and your hardware specs in the comments. Let’s talk about the future of the compute-heavy web.

How to speed up Node.js matrix computing with Math.js 🌠

ScaleDynamics — Thu, 07 Feb 2019 14:09:13 +0000

This article was originally published on Medium by Dominique Péré, a member of WarpJS.

This is part 1 of a series of articles on micro-benchmarks for matrix computations. This first article focuses on a math.js benchmark, and part 2 will discuss a TensorFlow benchmark. Make sure to subscribe if you don’t want to miss it!

In this article, you will learn how performing parallel computations can speed up the multiplication of two matrices.

I recently had occasion to revisit some of the math I learned in high school. Finally, I can see the use of all those matrix multiplication exercises! My background is in IT engineering, but I have to admit that AI involves much more math than IT does.

I am now working for the company that is developing Starnode, a JavaScript library designed to speed up node.js. The only problem with JavaScript is that it is only able to carry out computations using a single thread, a single process and the CPU (it’s like a restaurant with only one chef in the kitchen!). Why is JavaScript designed like this? The purpose is to keep it simple and non-blocking. You can find out a lot more about this aspect of JavaScript in this article.

Why matrix computing take forever

Matrix multiplication is a recurring operation performed in many domains, such as signal processing, data analysis and, more recently, AI.

In these use cases, the matrices implemented are rather large, frequently containing more than a thousand lines. Let’s assume we are multiplying two matrices, each one with dimensions 1000 × 1000. The number of operations that would need to be performed would be:

2 × 1,000³ − 1,000² = 1,999,000,000

That’s right — nearly 2 billion operations! It’s no surprise the CPU is so busy when performing such computations. With so much on its plate, it can’t do anything else! So let’s see what we can do to free up the main CPU thread and event loop and speed up the process.

The key to speeding up matrix computation: parallelization

Here is the challenge: to speed up the multiplication of two large matrices with a single-threaded node. Well, we could have used the child_process library to fork another process and assign parts of the job to the forked process (or have done the same with the worker threads), but we wanted to keep our code simple and come up with a solution that will work with a variable number of CPU/threads. By chance, we have some of the most skilled virtual machine PhDs and engineers working with us to help us optimize the parallelization, and we created Starnode, a very simple API that can be used to parallelize any standard JavaScript function. Now with the ability to perform fine-grained parallelization, we worked to determine how much time would be saved with large matrix computations.

My hardware engineer colleague (who happens to be a former math professor!) and I focused on possible ways to parallelize a sequential algorithm, as this would allow us to split operations for large matrices between multiple processing resources using the JavaScript-based ScaleDynamics “warp,” a dynamic compiler technology. (more to come about this is in another story).

Splitting and computing in parallel

To parallelize matrix multiplication efficiently, be it with Starnode technology or using any other parallelization technique, one must start by identifying independent blocks of operations that can take place concurrently, with minimal overhead time for the execution of splits and recombinations and minimum data transfer.

We tried two different approaches, splitting down matrices band-wise in the first approach, and splitting tile-wise in the second. Band-wise splitting worked well for small matrices, but when we tried with larger matrices (a 400 hundred lines or more), we found that tile-wise splitting was the best way to go.

Below, one can see how these two input-matrix splitting schemes are implemented for the product R = A × B:

In the case of a band-wise split, A is split into blocks of consecutive rows. Each block Ai is then multiplied by the full matrix B, yielding the result Ri, which constitutes a block of consecutive rows in the product matrix R.

figcaption

Figure 1a: band-wise split

In a tile-wise split, A is split into blocks of consecutive rows and B into blocks of consecutive columns. Each block Ai is then multiplied by the block Bi, yielding Ri, which constitutes a “tile” in the product matrix R.

Figure 1b: tile-wise split
Matrix shapes have little impact for a given number of elements, as long as the form factor of the matrix is not excessively rectangular. With small matrices, band-wise splits entail slightly less parallelization overhead than tile-wise splits thanks to the faster B-matrix readings and very straightforward process for merging blocks in the product matrix. This advantage vanishes rapidly, however, as the size of the B matrix increases due to the cache hierarchy conflicts that result from all processes using full B array data.

The CPUs are burning!

As our approach effectively uses all the resources of your computer, you can expect the fans to run faster, the temperature to increase and your matrices to be computed in a snap!

We have run all our tests on a dedicated server with a CPU Intel i7–7700 4 core/8 threads 4.2GHz and 32GB RAM.

The following chart shows the time required to multiply math.js matrices of various sizes in node.js without Starnode and with Starnode, as well as the speedup factor when using Starnode in each case. As you can see, the larger the matrix is, the larger the speedup!

This chart shows only the results of using the tile-wise parallelization method, as this method provided the best performance with node.js for matrices larger than 400 × 400.

As you can see, node.js with Starnode completed matrix multiplication up to six times faster than regular node.js!

You can find below the detailed results for the two split methods. In this table:

m is the number of lines in the A matrix
p is the number of lines in the B matrix (as well as the number of columns in A)
n is the number of columns in the B matrix

We are very excited with these results, as we initially only expected to achieve a speedup factor of 2 or 3 at this scale of parallelization. Surprisingly, when implementing Starnode parallelization, very little overhead is required to make two processes “talk to each other,” resulting in much-improved computation speeds. For example, for the multiplication of a 2000 × 1200 matrix, we achieved a speedup factor of 6.1! ⚡

The team is also currently working on a TensorFlow benchmark with the same operating mode, which I will link to here soon. Make sure to subscribe to learn new math skills to impress your colleagues! 🤓

Thank you for reading! If you liked this article (or if you didn’t), feel free to leave a comment. We'll do our best to reply and update this article accordingly.