There's a lot of ways to represent matrix math in JS. Some are readable and some are fast. I wanted to explore the differences a bit. How much are certain techniques actually saving me?
For this I'll just be looking at a single operation: element-wise addition to reduce the total cases, but difference operations could slightly change the overall values especially ones like matrix multiplication which require slightly more complex application rules. The states are also on my computer which is a slightly older i7 8700K using Deno which is v8 under-the-hood. A different runtime like Bun might behave very differently if there are different optimizations.
The simple functional way
I thought I'd start here as this is how I might write it for a first draft. It's very code optimized but I'd suspect the performance is actually pretty bad. Immutability is great for avoiding bugs but terrible for perf, especially when JS doesn't have intelligent copies.
//mat.js
export function addMatrixFunc(a, b) {
return a.map((row, ri) => row.map((val, ci) => b[ri][ci] + val));
}
The matrix representation is an array of arrays. The outer array is rows, the inner arrays are columns.
Using Deno's built-in benchmarking tool, we can see how this performs on different size matrices.
import { addMatrixFunc } from "./mat.js";
import { mat100A, mat100B } from "./data/mat-data.js";
Deno.bench("Add 1x1", () => {
addMatrixFunc([[44]], [[65]]);
});
Deno.bench("Add 2x2", () => {
addMatrixFunc([[44]], [[65]]);
});
Deno.bench("Add 4x4", () => {
addMatrixFunc([[44]], [[65]]);
});
/* ... */
Deno.bench("Add 100x100", () => {
addMatrixFunc(mat100A, mat100B);
});
The mat100A
and mat100B
are pre-generated 100x100 matrices which are way too big to put in the test file.
Some notes here is that I don't think Deno lets you set the iterations or warmup iterations, at least anymore. I think it just looks for convergence of numbers. The actual number of runs is displayed in the JSON output and it's slightly different per test.
Here's how we do:
Name | min | max | avg | p75 | p99 | p995 |
---|---|---|---|---|---|---|
Add 1x1 (Func) | 63ns | 180ns | 70ns | 74ns | 113ns | 124ns |
Add 2x2 (Func) | 144ns | 208ns | 152ns | 158ns | 184ns | 196ns |
Add 4x4 (Func) | 312ns | 373ns | 329ns | 335ns | 370ns | 373ns |
Add 8x8 (Func) | 694ns | 930ns | 724ns | 731ns | 930ns | 930ns |
Add 16x16 (Func) | 1798ns | 1942ns | 1836ns | 1843ns | 1942ns | 1942ns |
Add 32x32 (Func) | 5274ns | 6599ns | 5495ns | 5605ns | 6599ns | 6599ns |
Add 64x64 (Func) | 13000ns | 2331200ns | 17451ns | 16300ns | 41900ns | 60700ns |
Add 100x100 (Func) | 30800ns | 512800ns | 40269ns | 38200ns | 105700ns | 218300ns |
Loops
So the first way I think we can improve is loops. Functions have overhead so if we remove that and be a little more imperative we can be a bit faster.
export function addMatrixLoop(a, b) {
const out = [];
for (let row = 0; row < a.length; row++) {
const arrayRow = [];
for (let col = 0; col < a[0].length; col++) {
arrayRow.push(a[row][col] + b[row][col])
}
out.push(arrayRow);
}
return out;
}
Note that I'm not going to do strict bounds checking, we're just assuming a
and b
are the same size as the bounds checks just add overhead.
Name | min | max | avg | p75 | p99 | p995 |
---|---|---|---|---|---|---|
Add 1x1 (Loop) | 28ns | 210ns | 46ns | 47ns | 142ns | 168ns |
Add 2x2 (Loop) | 55ns | 163ns | 71ns | 76ns | 125ns | 143ns |
Add 4x4 (Loop) | 122ns | 227ns | 143ns | 151ns | 195ns | 225ns |
Add 8x8 (Loop) | 360ns | 807ns | 411ns | 422ns | 744ns | 807ns |
Add 16x16 (Loop) | 1179ns | 1246ns | 1208ns | 1217ns | 1246ns | 1246ns |
Add 32x32 (Loop) | 5031ns | 5216ns | 5090ns | 5105ns | 5216ns | 5216ns |
Add 64x64 (Loop) | 14300ns | 362400ns | 20651ns | 19200ns | 52900ns | 110500ns |
Add 100x100 (Loop) | 38200ns | 425400ns | 54401ns | 54100ns | 227700ns | 256300ns |
Loops start out faster but once we hit around 32x32 the they are equal to .map
and larger than that .map
is faster. Very surprising!
Pre-allocating arrays
My next idea was to pre-allocate the arrays since pushing into an array can cause re-sizes and maybe these are why this is slower.
export function addMatrixLoopPreAlloc(a, b) {
const out = new Array(a.length);
for (let row = 0; row < a.length; row++) {
const arrayRow = new Array(a[0].length);
for (let col = 0; col < a[0].length; col++) {
arrayRow[col] = a[row][col] + b[row][col];
}
out[row] = arrayRow;
}
return out;
}
Name | min | max | avg | p75 | p99 | p995 |
---|---|---|---|---|---|---|
Add 1x1 (Loop Prealloc) | 13ns | 137ns | 18ns | 20ns | 56ns | 73ns |
Add 2x2 (Loop Prealloc) | 25ns | 65ns | 28ns | 27ns | 45ns | 53ns |
Add 4x4 (Loop Prealloc) | 61ns | 152ns | 73ns | 78ns | 124ns | 129ns |
Add 8x8 (Loop Prealloc) | 203ns | 444ns | 228ns | 232ns | 348ns | 434ns |
Add 16x16 (Loop Prealloc) | 710ns | 942ns | 762ns | 768ns | 942ns | 942ns |
Add 32x32 (Loop Prealloc) | 2648ns | 2769ns | 2700ns | 2716ns | 2769ns | 2769ns |
Add 64x64 (Loop Prealloc) | 9500ns | 372100ns | 10926ns | 10100ns | 25000ns | 35800ns |
Add 100x100 (Loop Prealloc) | 24500ns | 515800ns | 28392ns | 26300ns | 62100ns | 204400ns |
That did the trick! We're about 1.5x faster than where we started!
Unrolling the loops
What if we just removed all the loops and wrote it out long-hand?
export function addMatrix4x4(a, b) {
return [
[a[0][0] + b[0][0], a[0][1] + b[0][1], a[0][2] + b[0][2], a[0][3] + b[0][3]],
[a[1][0] + b[1][0], a[1][1] + b[1][1], a[1][2] + b[1][2], a[1][3] + b[1][3]],
[a[2][0] + b[2][0], a[2][1] + b[2][1], a[2][2] + b[2][2], a[2][3] + b[2][3]],
[a[3][0] + b[3][0], a[3][1] + b[3][1], a[3][2] + b[3][2], a[3][3] + b[3][3]]
];
}
This is not very flexible as you need a function for every shape of matrix you want to add. However, in some cases like 3D this isn't too bad because you have a very finite amount of things, usually only ever 4x4. In machine learning this might cause problems.
Here's a function that generates the javascript text for the unrolled loop:
export function genMatAddBody(rows, cols) {
let funcBody = "return [\n";
for (let r = 0; r < rows; r++) {
funcBody += "\t\t["
for (let c = 0; c < cols; c++) {
funcBody += `a[${r}][${c}] + b[${r}][${c}]${c < cols - 1 ? ", " : ""}`
}
funcBody += `]${r < rows - 1 ? ", " : ""}\n`
}
funcBody += `\t];\n`
return funcBody;
}
export function genMatAddFunc(rows, cols) {
rows = Number(rows);
cols = Number(cols);
const body = genMatAddBody(rows, cols);
return new Function("a", "b", body);
}
I was also curious if making this dynamic generation changes much:
export function genMatAddFunc(rows, cols) {
rows = Number(rows); //prevents code injection
cols = Number(cols);
const body = genMatAddBody(rows, cols);
return new Function("a", "b", body);
}
Since we are using eval we should make sure to sanitize the input.
const addMatrix1x1Dyn = genMatAddFunc(1,1);
const addMatrix2x2Dyn = genMatAddFunc(2,2);
const addMatrix4x4Dyn = genMatAddFunc(4,4);
// etc.
const addMatrix100x100Dyn = genMatAddFunc(100,100);
Name | min | max | avg | p75 | p99 | p995 |
---|---|---|---|---|---|---|
Add 1x1 (unrolled) | 7ns | 34ns | 8ns | 8ns | 19ns | 20ns |
Add 1x1 (unrolled dynamic) | 7ns | 40ns | 8ns | 7ns | 19ns | 20ns |
Add 2x2 (unrolled) | 11ns | 46ns | 13ns | 12ns | 26ns | 29ns |
Add 2x2 (unrolled dynamic) | 11ns | 39ns | 12ns | 12ns | 27ns | 29ns |
Add 4x4 (unrolled) | 36ns | 159ns | 59ns | 72ns | 124ns | 130ns |
Add 4x4 (unrolled dynamic) | 36ns | 236ns | 67ns | 84ns | 156ns | 181ns |
Add 8x8 (unrolled) | 92ns | 243ns | 130ns | 142ns | 235ns | 242ns |
Add 8x8 (unrolled dynamic) | 89ns | 262ns | 113ns | 119ns | 186ns | 209ns |
Add 16x16 (unrolled) | 500ns | 672800ns | 734ns | 600ns | 3400ns | 10500ns |
Add 16x16 (unrolled dynamic) | 500ns | 2052000ns | 799ns | 600ns | 6400ns | 10600ns |
Add 32x32 (unrolled) | 73800ns | 562500ns | 83976ns | 85200ns | 136400ns | 160600ns |
Add 32x32 (unrolled dynamic) | 73000ns | 908200ns | 90772ns | 90900ns | 137900ns | 162600ns |
Add 64x64 (unrolled) | 328700ns | 737300ns | 350104ns | 343900ns | 574500ns | 587000ns |
Add 64x64 (unrolled dynamic) | 327600ns | 698800ns | 349201ns | 345400ns | 573900ns | 592400ns |
Add 100x100 (unrolled) | 829600ns | 1250900ns | 876580ns | 873700ns | 1143900ns | 1157500ns |
Add 100x100 (unrolled dynamic) | 816900ns | 1416300ns | 891844ns | 894500ns | 1227700ns | 1288200ns |
It's a big improvement for small values beating the pre-allocated loop by about 1.5 to 2x but not good for large ones being considerably slower. I'm not sure why that's the case, maybe it has to due with the size of the function itself? The code generated is massive. Also dynamic generation is basically the same as writing them out. So if you want to save payload (and aren't limited by CSP) you can dynamically create these at no penalty.
Flattening the arrays
Another thing where I think we can save is the arrays. We technically don't need to have a lot of nested arrays and they'll add some overhead to create. So now a 2x2 array looks like this:
[
4, 7,
10, 5
]
However you now need to know the dimensions for this to work because different rectangular shapes can have the same number of elements. So maybe let's make it an object.
{
shape: [2,2],
data: [
4, 7,
10, 5
]
}
The shape is an array rather than properties because we could scale this idea up into N-dimensional tensors. In fact, this is how libraries like tensorflowjs do it. For convenience let's build some function to convert between the formats.
export function nestedArrayToFlat(nested){
return {
shape: [nested.length, nested[0].length],
data: nested.flat(Infinity)
}
}
export function flatToNestedArray(flat){
const data = new Array(flat.shape[0]);
for(let row = 0; row < flat.shape[0]; row++){
const rowArray = new Array(flat.shape[1]);
for(let col = 0; col < flat.shape[1]; col++){
rowArray[col] = flat.data[row * flat.shape[1] + col];
}
data[row] = rowArray;
}
return data;
}
So far I think pre-allocated arrays and loops have the best general performance scaling to larger values so we'll stick to that for now. This also means I'll be omitting flat and loop since they aren't winning in any category as well as dynamic because it's the same as unrolled.
export function addMatrixFlat(a, b) {
const out = {
shape: a.shape,
data: new Array(a.data.length)
};
for (let row = 0; row < a.shape[0]; row++) {
for (let col = 0; col < a.shape[1]; col++) {
const index = (row * a.shape[1]) + col;
out.data[index] = a.data[index] + b.data[index];
}
}
return out;
}
Name | min | max | avg | p75 | p99 | p995 |
---|---|---|---|---|---|---|
Add 1x1 (flat) | 9ns | 53ns | 10ns | 10ns | 24ns | 29ns |
Add 2x2 (flat) | 14ns | 49ns | 15ns | 15ns | 29ns | 30ns |
Add 4x4 (flat) | 32ns | 107ns | 40ns | 46ns | 86ns | 94ns |
Add 8x8 (flat) | 97ns | 167ns | 110ns | 113ns | 143ns | 157ns |
Add 16x16 (flat) | 400ns | 548ns | 436ns | 447ns | 517ns | 548ns |
Add 32x32 (flat) | 1985ns | 2900ns | 2222ns | 2276ns | 2900ns | 2900ns |
Add 64x64 (flat) | 8512ns | 10514ns | 8775ns | 8715ns | 10514ns | 10514ns |
Add 100x100 (flat) | 15500ns | 701100ns | 23261ns | 21800ns | 54200ns | 194800ns |
It's about 20% faster on larger matrices than our previous best but 20% slower on 1x1 and 2x2 than unrolled. Since those aren't too important I'd say this is a big win.
Row vs column major
Does it matter if we traverse over rows versus columns? One might suspect that it could when CPU caching an stuff gets involved, but let's test.
export function addMatrixFlatColMajor(a, b) {
const out = {
shape: a.shape,
data: new Array(a.data.length)
};
for (let col = 0; col < a.shape[1]; col++) {
for (let row = 0; row < a.shape[0]; row++) {
const index = (row * a.shape[1]) + col;
out.data[index] = a.data[index] + b.data[index];
}
}
return out;
}
Name | min | max | avg | p75 | p99 | p995 |
---|---|---|---|---|---|---|
Add 1x1 (flat col major) | 9ns | 41ns | 10ns | 9ns | 21ns | 22ns |
Add 2x2 (flat col major) | 14ns | 41ns | 15ns | 14ns | 29ns | 32ns |
Add 4x4 (flat col major) | 32ns | 79ns | 37ns | 37ns | 61ns | 67ns |
Add 8x8 (flat col major) | 101ns | 156ns | 114ns | 116ns | 147ns | 153ns |
Add 16x16 (flat col major) | 423ns | 532ns | 453ns | 465ns | 513ns | 532ns |
Add 32x32 (flat col major) | 2047ns | 3228ns | 2199ns | 2258ns | 3228ns | 3228ns |
Add 64x64 (flat col major) | 7500ns | 413800ns | 10417ns | 10200ns | 26200ns | 37000ns |
Add 100x100 (flat col major) | 19800ns | 575300ns | 25090ns | 23500ns | 63000ns | 198500ns |
As it turn out, column major traversal is actually a tiny bit slower than row major traversal. This is likely because the cache lines are being read more optimally.
However, since element-wise add is so simple we can actually just ditch the loop structure and just add all the elements linearly with a single loop.
export function addMatrixFlatSimple(a, b) {
const out = {
shape: a.shape,
data: new Array(a.data.length)
};
for(let i = 0; i < a.data.length; i++){
out.data[i] = a.data[i] + b.data[i];
}
return out;
}
Name | min | max | avg | p75 | p99 | p995 |
---|---|---|---|---|---|---|
Add 1x1 (flat simple) | 7ns | 46ns | 8ns | 8ns | 18ns | 20ns |
Add 2x2 (flat simple) | 9ns | 54ns | 10ns | 10ns | 23ns | 26ns |
Add 4x4 (flat simple) | 18ns | 77ns | 24ns | 28ns | 51ns | 56ns |
Add 8x8 (flat simple) | 55ns | 159ns | 73ns | 78ns | 125ns | 136ns |
Add 16x16 (flat simple) | 276ns | 405ns | 315ns | 335ns | 393ns | 405ns |
Add 32x32 (flat simple) | 1387ns | 1682ns | 1490ns | 1547ns | 1682ns | 1682ns |
Add 64x64 (flat simple) | 6381ns | 7219ns | 6602ns | 6675ns | 7219ns | 7219ns |
Add 100x100 (flat simple) | 9000ns | 598000ns | 17166ns | 15700ns | 49400ns | 178400ns |
This is like 20%+ faster.
Unrolled
We can also unroll these too and see what happens, maybe the simpler structure helps? Using this code:
export function genMatAddFlatBody(rows, cols){
let funcBody = "return [\n";
for (let r = 0; r < rows; r++) {
for (let c = 0; c < cols; c++) {
funcBody += `a[${r * cols + c}] + b[${r * cols + c}]${(c * r) < ((rows - 1) * (cols - 1)) ? ", " : ""}`
}
}
funcBody += `];\n`
return funcBody;
}
We can generate functions like this:
export function addMatrixFlat2x2(a, b) {
return [
a[0] + b[0], a[1] + b[1], a[2] + b[2], a[3] + b[3]];
}
We can dynamically create them with eval like this:
export function genMatAddFlatFunc(rows, cols) {
rows = Number(rows);
cols = Number(cols);
const body = genMatAddFlatBody(rows, cols);
return new Function("a", "b", body);
}
Name | min | max | avg | p75 | p99 | p995 |
---|---|---|---|---|---|---|
Add 1x1 (flat unrolled) | 6ns | 53ns | 7ns | 7ns | 19ns | 22ns |
Add 2x2 (flat unrolled) | 7ns | 62ns | 8ns | 8ns | 21ns | 23ns |
Add 4x4 (flat unrolled) | 24ns | 136ns | 37ns | 41ns | 84ns | 93ns |
Add 8x8 (flat unrolled) | 61ns | 185ns | 81ns | 86ns | 131ns | 144ns |
Add 16x16 (flat unrolled) | 300ns | 564700ns | 508ns | 400ns | 1000ns | 6100ns |
Add 32x32 (flat unrolled) | 63600ns | 826700ns | 74574ns | 75200ns | 133000ns | 162600ns |
Add 64x64 (flat unrolled) | 263500ns | 788800ns | 286503ns | 280600ns | 502900ns | 528900ns |
Add 100x100 (flat unrolled) | 706400ns | 1760300ns | 764369ns | 758900ns | 1102800ns | 1118900ns |
It just edges the simple loop out at 1x1 and 2x2 and beyond that it loses and get much, much worse at large sizes.
Typed arrays
So the next area of possible optimization I can see is to actually use types. We can do this in Javascript using typed-arrays. This will allow us to allocate a block of memory and reduce the overhead of any array structure. This is actually a little more important though. By using typed-arrays we can actually reduce conversions. APIs like WASM, WebGL and WebGPU deal with blocks of memory and the less we have to convert the faster we're going to wind up. So I think even if it turns out this is a little bit slower, there's still good reasons to pursue it. Although we end up with different paths, one for floats and one for integers and even then it might matter if we choose different bit-widths. Also, since we've already shown that flat structures perform better overall we don't need to consider nested arrays. For brevity I'm not going to test all Typed array combos as we'll start to see a general pattern.
Float 64
Name | min | max | avg | p75 | p99 | p995 |
---|---|---|---|---|---|---|
Add 1x1 (F64) | 330ns | 1600ns | 400ns | 397ns | 663ns | 1600ns |
Add 2x2 (F64) | 329ns | 598ns | 393ns | 409ns | 493ns | 598ns |
Add 4x4 (F64) | 393ns | 1786ns | 490ns | 503ns | 662ns | 1786ns |
Add 8x8 (F64) | 490ns | 778ns | 621ns | 664ns | 778ns | 778ns |
Add 16x16 (F64) | 1024ns | 5425ns | 1311ns | 1334ns | 5425ns | 5425ns |
Add 32x32 (F64) | 3346ns | 4707ns | 3772ns | 4115ns | 4707ns | 4707ns |
Add 64x64 (F64) | 8000ns | 2309700ns | 14203ns | 12700ns | 35300ns | 44800ns |
Add 100x100 (F64) | 23200ns | 3328400ns | 35026ns | 33300ns | 82400ns | 231000ns |
Javascript numbers are float 64s. So it's really surprising that these behave slower than a normal javascript array. Doing small arrays is actually slower than array.map
. I'm guessing this has something to do with how the engine treats them. As the matrices get larger these get faster but even at 100x100 items it's still quite bit slower than a normal flat array.
Float 32
Name | min | max | avg | p75 | p99 | p995 |
---|---|---|---|---|---|---|
Add 1x1 (F32) | 324ns | 554ns | 380ns | 391ns | 506ns | 554ns |
Add 2x2 (F32) | 324ns | 594ns | 391ns | 408ns | 520ns | 594ns |
Add 4x4 (F32) | 396ns | 658ns | 463ns | 489ns | 569ns | 658ns |
Add 8x8 (F32) | 508ns | 822ns | 620ns | 673ns | 822ns | 822ns |
Add 16x16 (F32) | 1148ns | 1784ns | 1345ns | 1422ns | 1784ns | 1784ns |
Add 32x32 (F32) | 3258ns | 3840ns | 3344ns | 3337ns | 3840ns | 3840ns |
Add 64x64 (F32) | 10500ns | 1101800ns | 18473ns | 21600ns | 66500ns | 101200ns |
Add 100x100 (F32) | 25800ns | 1797500ns | 37062ns | 35800ns | 99800ns | 245400ns |
F32 arrays have the same problem as Float64s. The performance is near identical despite being smaller so for pure speed there's no point in choosing them. In fact at 100x100 F64 arrays are decently faster. The only benefit we get is halfing our memory which might be a reason to choose these.
Int 32
Name | min | max | avg | p75 | p99 | p995 |
---|---|---|---|---|---|---|
Add 1x1 (I32) | 321ns | 1015ns | 390ns | 398ns | 704ns | 1015ns |
Add 2x2 (I32) | 324ns | 570ns | 390ns | 403ns | 501ns | 570ns |
Add 4x4 (I32) | 372ns | 530ns | 426ns | 443ns | 488ns | 530ns |
Add 8x8 (I32) | 455ns | 621ns | 539ns | 575ns | 616ns | 621ns |
Add 16x16 (I32) | 784ns | 1202ns | 913ns | 966ns | 1202ns | 1202ns |
Add 32x32 (I32) | 2111ns | 2704ns | 2182ns | 2182ns | 2704ns | 2704ns |
Add 64x64 (I32) | 8742ns | 9569ns | 9138ns | 9305ns | 9569ns | 9569ns |
Add 100x100 (I32) | 12600ns | 2578300ns | 22470ns | 21600ns | 50300ns | 72200ns |
I32s again behave similarly but start to see much bigger gains at larger matrices. In fact at 100x100 the I32 matrix is about equal to a flat matrix. Not amazing but if you are dealing with large integer matrices this is probably your best choice.
Conclusion
For simple, single-threaded javascript we've observed a few things (in Deno/V8 @ 2023-03-31):
- Loops will mostly perform better than
.map
except at very large values and only with nested arrays (I tried a flat array and it wasn't notable enough to copy-paste the data). - Bespoke unrolled functions work well on very small sizes 4x4 or less but doesn't beat a simple loop and falls off very, very quickly.
- Reducing structure makes a big difference.
- Pre-allocating arrays makes a huge different, always do this if you can.
- Typed-arrays provide no speed advantage (but we might get less conversion overhead and space-space saving).
There's more ways we can deal matrices, I'd like to maybe look at what WASM and WebGPU look like with high overhead but potentially massive speed increases in actual calculation due to parallelism. Web Workers as well. Also different ops can vary widely. Matrix multiplication uses the left-hand and right-hand structures differently and might require some different strategies. But I think the biggest take-away:
Your best bet for a generalized element-wise matrix op is a single flat loop over a normal JS array as it's fast and scales well
The Data
Name | min | max | avg | p75 | p99 | p995 |
---|---|---|---|---|---|---|
Add 1x1 (Func) | 63ns | 180ns | 70ns | 74ns | 113ns | 124ns |
Add 1x1 (Loop) | 28ns | 210ns | 46ns | 47ns | 142ns | 168ns |
Add 1x1 (Loop Prealloc) | 13ns | 137ns | 18ns | 20ns | 56ns | 73ns |
Add 1x1 (unrolled) | 7ns | 34ns | 8ns | 8ns | 19ns | 20ns |
Add 1x1 (unrolled dynamic) | 7ns | 40ns | 8ns | 7ns | 19ns | 20ns |
Add 1x1 (flat) | 9ns | 53ns | 10ns | 10ns | 24ns | 29ns |
Add 1x1 (flat col major) | 9ns | 41ns | 10ns | 9ns | 21ns | 22ns |
Add 1x1 (flat simple) | 7ns | 46ns | 8ns | 8ns | 18ns | 20ns |
Add 1x1 (flat unrolled) | 6ns | 53ns | 7ns | 7ns | 19ns | 22ns |
Add 1x1 (F64) | 330ns | 1600ns | 400ns | 397ns | 663ns | 1600ns |
Add 1x1 (F32) | 324ns | 554ns | 380ns | 391ns | 506ns | 554ns |
Add 1x1 (I32) | 321ns | 1015ns | 390ns | 398ns | 704ns | 1015ns |
Add 2x2 (Func) | 144ns | 208ns | 152ns | 158ns | 184ns | 196ns |
Add 2x2 (Loop) | 55ns | 163ns | 71ns | 76ns | 125ns | 143ns |
Add 2x2 (Loop Prealloc) | 25ns | 65ns | 28ns | 27ns | 45ns | 53ns |
Add 2x2 (unrolled) | 11ns | 46ns | 13ns | 12ns | 26ns | 29ns |
Add 2x2 (unrolled dynamic) | 11ns | 39ns | 12ns | 12ns | 27ns | 29ns |
Add 2x2 (flat) | 14ns | 49ns | 15ns | 15ns | 29ns | 30ns |
Add 2x2 (flat col major) | 14ns | 41ns | 15ns | 14ns | 29ns | 32ns |
Add 2x2 (flat simple) | 9ns | 54ns | 10ns | 10ns | 23ns | 26ns |
Add 2x2 (flat unrolled) | 7ns | 62ns | 8ns | 8ns | 21ns | 23ns |
Add 2x2 (F64) | 329ns | 598ns | 393ns | 409ns | 493ns | 598ns |
Add 2x2 (F32) | 324ns | 594ns | 391ns | 408ns | 520ns | 594ns |
Add 2x2 (I32) | 324ns | 570ns | 390ns | 403ns | 501ns | 570ns |
Add 4x4 (Func) | 312ns | 373ns | 329ns | 335ns | 370ns | 373ns |
Add 4x4 (Loop) | 122ns | 227ns | 143ns | 151ns | 195ns | 225ns |
Add 4x4 (Loop Prealloc) | 61ns | 152ns | 73ns | 78ns | 124ns | 129ns |
Add 4x4 (unrolled) | 36ns | 159ns | 59ns | 72ns | 124ns | 130ns |
Add 4x4 (unrolled dynamic) | 36ns | 236ns | 67ns | 84ns | 156ns | 181ns |
Add 4x4 (flat) | 32ns | 107ns | 40ns | 46ns | 86ns | 94ns |
Add 4x4 (flat col major) | 32ns | 79ns | 37ns | 37ns | 61ns | 67ns |
Add 4x4 (flat simple) | 18ns | 77ns | 24ns | 28ns | 51ns | 56ns |
Add 4x4 (flat unrolled) | 24ns | 136ns | 37ns | 41ns | 84ns | 93ns |
Add 4x4 (F64) | 393ns | 1786ns | 490ns | 503ns | 662ns | 1786ns |
Add 4x4 (F32) | 396ns | 658ns | 463ns | 489ns | 569ns | 658ns |
Add 4x4 (I32) | 372ns | 530ns | 426ns | 443ns | 488ns | 530ns |
Add 8x8 (Func) | 694ns | 930ns | 724ns | 731ns | 930ns | 930ns |
Add 8x8 (Loop) | 360ns | 807ns | 411ns | 422ns | 744ns | 807ns |
Add 8x8 (Loop Prealloc) | 203ns | 444ns | 228ns | 232ns | 348ns | 434ns |
Add 8x8 (unrolled) | 92ns | 243ns | 130ns | 142ns | 235ns | 242ns |
Add 8x8 (unrolled dynamic) | 89ns | 262ns | 113ns | 119ns | 186ns | 209ns |
Add 8x8 (flat) | 97ns | 167ns | 110ns | 113ns | 143ns | 157ns |
Add 8x8 (flat col major) | 101ns | 156ns | 114ns | 116ns | 147ns | 153ns |
Add 8x8 (flat simple) | 55ns | 159ns | 73ns | 78ns | 125ns | 136ns |
Add 8x8 (flat unrolled) | 61ns | 185ns | 81ns | 86ns | 131ns | 144ns |
Add 8x8 (F64) | 490ns | 778ns | 621ns | 664ns | 778ns | 778ns |
Add 8x8 (F32) | 508ns | 822ns | 620ns | 673ns | 822ns | 822ns |
Add 8x8 (I32) | 455ns | 621ns | 539ns | 575ns | 616ns | 621ns |
Add 16x16 (Func) | 1798ns | 1942ns | 1836ns | 1843ns | 1942ns | 1942ns |
Add 16x16 (Loop) | 1179ns | 1246ns | 1208ns | 1217ns | 1246ns | 1246ns |
Add 16x16 (Loop Prealloc) | 710ns | 942ns | 762ns | 768ns | 942ns | 942ns |
Add 16x16 (unrolled) | 500ns | 672800ns | 734ns | 600ns | 3400ns | 10500ns |
Add 16x16 (unrolled dynamic) | 500ns | 2052000ns | 799ns | 600ns | 6400ns | 10600ns |
Add 16x16 (flat) | 400ns | 548ns | 436ns | 447ns | 517ns | 548ns |
Add 16x16 (flat col major) | 423ns | 532ns | 453ns | 465ns | 513ns | 532ns |
Add 16x16 (flat simple) | 276ns | 405ns | 315ns | 335ns | 393ns | 405ns |
Add 16x16 (flat unrolled) | 300ns | 564700ns | 508ns | 400ns | 1000ns | 6100ns |
Add 16x16 (F64) | 1024ns | 5425ns | 1311ns | 1334ns | 5425ns | 5425ns |
Add 16x16 (F32) | 1148ns | 1784ns | 1345ns | 1422ns | 1784ns | 1784ns |
Add 16x16 (I32) | 784ns | 1202ns | 913ns | 966ns | 1202ns | 1202ns |
Add 32x32 (Func) | 5274ns | 6599ns | 5495ns | 5605ns | 6599ns | 6599ns |
Add 32x32 (Loop) | 5031ns | 5216ns | 5090ns | 5105ns | 5216ns | 5216ns |
Add 32x32 (Loop Prealloc) | 2648ns | 2769ns | 2700ns | 2716ns | 2769ns | 2769ns |
Add 32x32 (unrolled) | 73800ns | 562500ns | 83976ns | 85200ns | 136400ns | 160600ns |
Add 32x32 (unrolled dynamic) | 73000ns | 908200ns | 90772ns | 90900ns | 137900ns | 162600ns |
Add 32x32 (flat) | 1985ns | 2900ns | 2222ns | 2276ns | 2900ns | 2900ns |
Add 32x32 (flat col major) | 2047ns | 3228ns | 2199ns | 2258ns | 3228ns | 3228ns |
Add 32x32 (flat simple) | 1387ns | 1682ns | 1490ns | 1547ns | 1682ns | 1682ns |
Add 32x32 (flat unrolled) | 63600ns | 826700ns | 74574ns | 75200ns | 133000ns | 162600ns |
Add 32x32 (F64) | 3346ns | 4707ns | 3772ns | 4115ns | 4707ns | 4707ns |
Add 32x32 (F32) | 3258ns | 3840ns | 3344ns | 3337ns | 3840ns | 3840ns |
Add 32x32 (I32) | 2111ns | 2704ns | 2182ns | 2182ns | 2704ns | 2704ns |
Add 64x64 (Func) | 13000ns | 2331200ns | 17451ns | 16300ns | 41900ns | 60700ns |
Add 64x64 (Loop) | 14300ns | 362400ns | 20651ns | 19200ns | 52900ns | 110500ns |
Add 64x64 (Loop Prealloc) | 9500ns | 372100ns | 10926ns | 10100ns | 25000ns | 35800ns |
Add 64x64 (unrolled) | 328700ns | 737300ns | 350104ns | 343900ns | 574500ns | 587000ns |
Add 64x64 (unrolled dynamic) | 327600ns | 698800ns | 349201ns | 345400ns | 573900ns | 592400ns |
Add 64x64 (flat) | 8512ns | 10514ns | 8775ns | 8715ns | 10514ns | 10514ns |
Add 64x64 (flat col major) | 7500ns | 413800ns | 10417ns | 10200ns | 26200ns | 37000ns |
Add 64x64 (flat simple) | 6381ns | 7219ns | 6602ns | 6675ns | 7219ns | 7219ns |
Add 64x64 (flat unrolled) | 263500ns | 788800ns | 286503ns | 280600ns | 502900ns | 528900ns |
Add 64x64 (F64) | 8000ns | 2309700ns | 14203ns | 12700ns | 35300ns | 44800ns |
Add 64x64 (F32) | 10500ns | 1101800ns | 18473ns | 21600ns | 66500ns | 101200ns |
Add 64x64 (I32) | 8742ns | 9569ns | 9138ns | 9305ns | 9569ns | 9569ns |
Add 100x100 (Func) | 30800ns | 512800ns | 40269ns | 38200ns | 105700ns | 218300ns |
Add 100x100 (Loop) | 38200ns | 425400ns | 54401ns | 54100ns | 227700ns | 256300ns |
Add 100x100 (Loop Prealloc) | 24500ns | 515800ns | 28392ns | 26300ns | 62100ns | 204400ns |
Add 100x100 (unrolled) | 829600ns | 1250900ns | 876580ns | 873700ns | 1143900ns | 1157500ns |
Add 100x100 (unrolled dynamic) | 816900ns | 1416300ns | 891844ns | 894500ns | 1227700ns | 1288200ns |
Add 100x100 (flat) | 15500ns | 701100ns | 23261ns | 21800ns | 54200ns | 194800ns |
Add 100x100 (flat col major) | 19800ns | 575300ns | 25090ns | 23500ns | 63000ns | 198500ns |
Add 100x100 (flat simple) | 9000ns | 598000ns | 17166ns | 15700ns | 49400ns | 178400ns |
Add 100x100 (flat unrolled) | 706400ns | 1760300ns | 764369ns | 758900ns | 1102800ns | 1118900ns |
Add 100x100 (F64) | 23200ns | 3328400ns | 35026ns | 33300ns | 82400ns | 231000ns |
Add 100x100 (F32) | 25800ns | 1797500ns | 37062ns | 35800ns | 99800ns | 245400ns |
Add 100x100 (I32) | 12600ns | 2578300ns | 22470ns | 21600ns | 50300ns | 72200ns |
Top comments (1)
This is some heavy stuff tbh. Great Job mate :claps: