When Go 1.26 released - they released something interesting under the experimental tag.
SIMD: Single instruction multiple data
Now the problem that modern cpus face is that cpu clock rate has stagnated.
So if we take scalar approach to executing instruction. For example float multiplication.
Here we take two numbers and we multiply them right - which is fine and dandy.
But designers long ago (literally) asked a question "What if we could theoretically widen the operation so that developers could place more data on one instruction?"
And they did!
We got:
-
XMM0–XMM15which has the width of128 bits -
YMM0–YMM15which has the width of256 bits -
ZMM0–ZMM31which has the width of512 bits
Lets take 256 bits for example...
We have a float which is 4 bytes == 32 bits which means we have place 8 floats into 256 bits wide instruction right?
SIMD shines on data-parallel workloads — operations where you apply the same transformation to every element of a large array.
Image processing, audio DSP, physics simulations, matrix math, cryptography, text search (finding bytes in a buffer), and vector similarity search all fit this pattern.
Go announced fantastic news that they are working on SIMD standard library and they have put it out for testing under the experimental flag.
It works only for amd64 architecture for now.
I prepared a short demonstration but we are using XOR which is heavily used in cryptography.
SIMD in Go — Cryptographic XOR Benchmark
Demonstrates Go 1.26's simd/archsimd package applied to the core operation behind
stream ciphers (AES-CTR, ChaCha20, OTP): XOR of a plaintext buffer with a keystream.
What's benchmarked
| Implementation | Strategy | Instructions |
|---|---|---|
XORScalar |
Byte-by-byte loop | XOR r8, r8 |
XORSimd256 |
32 bytes/iteration via AVX2 | VPXOR ymm, ymm, ymm |
XORSimd256Unrolled |
128 bytes/iteration (4× unrolled) | 4× VPXOR ymm per iter |
Run (macOS with Docker)
./run.sh
Expected speedup
On x86_64 (emulated via Docker on Apple Silicon, real numbers will be higher on native):
- Small payloads (256B): ~5-10× faster with SIMD
- Large payloads (1MB+): ~15-25× faster with SIMD (memory-bandwidth limited)
On native x86 hardware (e.g. Emerald Rapids Xeon), expect even better numbers since VPXOR has 1-cycle latency and can retire 3 per cycle on ports p0/p1/p5.
uops.info reference
For the presentation, show the Emerald Rapids measurements for:
-
VPXOR (YMM, YMM, YMM) — the instruction behind
Uint8x32.Xor()- link - VAESENC (XMM, XMM,…
where we can first checkout how the scalar operation would look like:
func XORScalar(destination, plaintext, keystream []byte) {
for i := range plaintext {
destination[i] = plaintext[i] ^ keystream[i]
}
}
Pretty straight forward!
Now lets look at SIMD Vector optimized code
// XORSimd256 applies a repeating keystream over plaintext using 256-bit (32-byte)
// SIMD vectors. Each VPXOR instruction XORs 32 bytes in a single cycle.
//
// This means the CPU can XOR 96 bytes per clock cycle with 256-bit vectors.
func XORSimd256(destination, plaintext, keystream []byte) {
n := len(plaintext)
i := 0
// Process 32 bytes at a time using AVX2 VPXOR
for i+32 <= n {
p := archsimd.LoadUint8x32((*[32]byte)(plaintext[i : i+32]))
k := archsimd.LoadUint8x32((*[32]byte)(keystream[i : i+32]))
r := p.Xor(k)
// r holds the XOR result in a SIMD register (a CPU vector register, not memory).
// .Store() writes those 32 bytes from the register back into the destination slice in main memory.
// Without it, the result would just be discarded when the register gets reused.
r.Store((*[32]byte)(destination[i : i+32]))
i += 32
}
// Handle remaining bytes (< 32) with scalar fallback
for ; i < n; i++ {
destination[i] = plaintext[i] ^ keystream[i]
}
}
And the results were stunning! Even as I ran it on a Macbook where we lost some performance due to architecture miss-match!
You can run it for yourselves and test the results out!
Here are mine:
Bottomline is that Go dev team is working on something great I think. It is a step in the right direction in my opinion! We don't need custom for loops in my opinion...
But for those sweet low level optimisations where you squeeze everything out of the hardware - this is a great step forward!



Top comments (0)