DEV Community

Haoren(Harrison) Hu
Haoren(Harrison) Hu

Posted on

Rust SIMD Benchmark: std::simd vs NEON on Apple M4

A friend shared Sylvain Kerkour's post SIMD programming in pure Rust which covers AVX-512 on AMD Zen 5. That got me curious about ARM's side of the story—specifically how std::simd compares to hand-written NEON intrinsics. My main development machine is a MacBook Pro with Apple M4, so I ran the benchmarks there.

I ran 9 benchmarks comparing three approaches: scalar, std::simd, and NEON. All numbers below are averaged from 3 runs.

Key finding: std::simd ranged from 9x faster to 7.7x slower than scalar code. NEON always delivered speedups (1.2x–4.3x). The difference comes down to data layout—interleaved data like RGB images and stereo audio exposed limitations in the portable SIMD abstraction.

Test Environment

  • CPU: Apple M4 (MacBook Pro 2024)
  • Rust: rustc 1.94.0-nightly (2026-01-14)
  • OS: macOS
  • Command: cargo run --release

Three implementations per scenario:

Approach Stability Portability
Scalar stable everywhere
std::simd nightly cross-platform
std::arch (NEON) stable ARM64 only

Results Summary

Scenario Scalar std::simd NEON std::simd NEON
RGB→Grayscale 6.16ms 18.51ms 1.45ms 0.33x 4.26x
Volume Adjust 1.39ms 3.42ms 0.82ms 0.40x 1.70x
Audio Mixing 0.63ms 4.84ms 0.53ms 0.13x 1.20x
Count Newlines 14.10ms 3.22ms 3.87ms 4.38x 3.65x
Find Byte 24.07ms 2.61ms 2.68ms 9.23x 9.00x
Dot Product 51.20ms 10.99ms 20.14ms 4.66x 2.54x
Matrix-Vec Mul 4.48ms 0.69ms 1.35ms 6.53x 3.32x
Range Check 24.93ms 8.24ms 9.33ms 3.02x 2.67x
Sorted Check 25.14ms 37.18ms 6.87ms 0.68x 3.66x

Three scenarios showed std::simd slower than scalar: RGB conversion (0.33x), audio mixing (0.13x), and sorted check (0.68x).


Scenario 1: RGB to Grayscale

Task: Convert 1920×1080 RGB image to grayscale using ITU-R BT.601 formula.

I used fixed-point arithmetic to avoid floating-point operations—much faster on integer pipelines: Gray = (77*R + 150*G + 29*B) >> 8

Scalar

The scalar version is straightforward with chunks_exact(3):

pub fn rgb_to_grayscale(rgb: &[u8], gray: &mut [u8]) {
    for (i, chunk) in rgb.chunks_exact(3).enumerate() {
        let r = chunk[0] as u32;
        let g = chunk[1] as u32;
        let b = chunk[2] as u32;
        gray[i] = ((77 * r + 150 * g + 29 * b) >> 8) as u8;
    }
}
Enter fullscreen mode Exit fullscreen mode

std::simd

pub fn rgb_to_grayscale(rgb: &[u8], gray: &mut [u8]) {
    let chunks = rgb.chunks_exact(48);
    let weight_r = u16x16::splat(77);
    let weight_g = u16x16::splat(150);
    let weight_b = u16x16::splat(29);

    let mut out_idx = 0;
    for chunk in chunks {
        // No deinterleave instruction available
        // Must use scalar loop: 48 memory accesses per iteration
        let mut r_vals = [0u8; 16];
        let mut g_vals = [0u8; 16];
        let mut b_vals = [0u8; 16];

        for i in 0..16 {
            r_vals[i] = chunk[i * 3];
            g_vals[i] = chunk[i * 3 + 1];
            b_vals[i] = chunk[i * 3 + 2];
        }

        let r = u16x16::from_array(r_vals.map(|x| x as u16));
        let g = u16x16::from_array(g_vals.map(|x| x as u16));
        let b = u16x16::from_array(b_vals.map(|x| x as u16));

        let gray_u16 = (r * weight_r + g * weight_g + b * weight_b) >> Simd::splat(8);

        let gray_u8: [u8; 16] = gray_u16.to_array().map(|x| x as u8);
        gray[out_idx..out_idx + 16].copy_from_slice(&gray_u8);
        out_idx += 16;
    }
}
Enter fullscreen mode Exit fullscreen mode

NEON

pub fn rgb_to_grayscale(rgb: &[u8], gray: &mut [u8]) {
    unsafe {
        let weight_r = vdupq_n_u8(77);
        let weight_g = vdupq_n_u8(150);
        let weight_b = vdupq_n_u8(29);

        for i in (0..simd_len).step_by(16) {
            // vld3q_u8: load + deinterleave in one instruction
            // [R0,G0,B0,R1,G1,B1,...] → R[], G[], B[]
            let rgb_data = vld3q_u8(rgb.as_ptr().add(i * 3));
            let r = rgb_data.0;
            let g = rgb_data.1;
            let b = rgb_data.2;

            // Widening multiply: u8 × u8 → u16
            let r_lo = vmull_u8(vget_low_u8(r), vget_low_u8(weight_r));
            let r_hi = vmull_high_u8(r, weight_r);
            let g_lo = vmull_u8(vget_low_u8(g), vget_low_u8(weight_g));
            let g_hi = vmull_high_u8(g, weight_g);
            let b_lo = vmull_u8(vget_low_u8(b), vget_low_u8(weight_b));
            let b_hi = vmull_high_u8(b, weight_b);

            let sum_lo = vaddq_u16(vaddq_u16(r_lo, g_lo), b_lo);
            let sum_hi = vaddq_u16(vaddq_u16(r_hi, g_hi), b_hi);

            // Shift right + narrow: u16 → u8
            let gray_lo = vshrn_n_u16(sum_lo, 8);
            let gray_hi = vshrn_n_u16(sum_hi, 8);
            let gray_vec = vcombine_u8(gray_lo, gray_hi);

            vst1q_u8(gray.as_mut_ptr().add(i), gray_vec);
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Results

Implementation Time vs Scalar
Scalar 6.16ms 1.0x
std::simd 18.51ms 0.33x
NEON 1.45ms 4.26x

Analysis: vld3q_u8 performs load and 3-way deinterleave in one instruction. std::simd lacks this capability, requiring a 16-iteration scalar loop that dominates execution time.


Scenario 2: Audio Volume Adjustment

Task: Multiply 880K audio samples (i16) by gain factor 0.8.

I used fixed-point here too—multiplying by 256 and shifting right 8 bits keeps everything in integer domain.

Scalar

pub fn adjust_volume(samples: &mut [i16], volume: f32) {
    for sample in samples.iter_mut() {
        let adjusted = (*sample as f32 * volume) as i32;
        *sample = adjusted.clamp(-32768, 32767) as i16;
    }
}
Enter fullscreen mode Exit fullscreen mode

std::simd

pub fn adjust_volume(samples: &mut [i16], volume: f32) {
    let vol_fixed = (volume * 256.0) as i32;
    let vol_vec = i32x8::splat(vol_fixed);

    for i in (0..simd_len).step_by(8) {
        // No i16 → i32 widening instruction
        let mut vals = [0i32; 8];
        for j in 0..8 {
            vals[j] = samples[i + j] as i32;
        }
        let v = i32x8::from_array(vals);

        let adjusted = (v * vol_vec) >> Simd::splat(8);
        let clamped = adjusted.simd_clamp(i32x8::splat(-32768), i32x8::splat(32767));

        // No i32 → i16 narrowing instruction
        for (j, &val) in clamped.to_array().iter().enumerate() {
            samples[i + j] = val as i16;
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

NEON

pub fn adjust_volume(samples: &mut [i16], volume: f32) {
    let vol_fixed = (volume * 256.0) as i32;

    unsafe {
        let vol_vec = vdupq_n_s32(vol_fixed);

        for i in (0..simd_len).step_by(8) {
            let v = vld1q_s16(samples.as_ptr().add(i));

            // vmovl: widen i16 → i32
            let v_lo = vmovl_s16(vget_low_s16(v));
            let v_hi = vmovl_high_s16(v);

            let mul_lo = vmulq_s32(v_lo, vol_vec);
            let mul_hi = vmulq_s32(v_hi, vol_vec);

            let shifted_lo = vshrq_n_s32(mul_lo, 8);
            let shifted_hi = vshrq_n_s32(mul_hi, 8);

            // vqmovn: saturating narrow i32 → i16
            let result_lo = vqmovn_s32(shifted_lo);
            let result_hi = vqmovn_s32(shifted_hi);
            let result = vcombine_s16(result_lo, result_hi);

            vst1q_s16(samples.as_mut_ptr().add(i), result);
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Results

Implementation Time vs Scalar
Scalar 1.39ms 1.0x
std::simd 3.42ms 0.40x
NEON 0.82ms 1.70x

Analysis: std::simd lacks type conversion instructions. vmovl (widen) and vqmovn (saturating narrow) are single instructions in NEON but require scalar loops in std::simd.


Scenario 3: Audio Mixing

Task: Mix two audio tracks using (a + b) / 2 to prevent overflow.

Division by 2 is the classic way to mix audio without clipping. Simple but requires widening to i32 first.

Scalar

pub fn mix_tracks(track_a: &[i16], track_b: &[i16], output: &mut [i16]) {
    for ((a, b), out) in track_a.iter().zip(track_b.iter()).zip(output.iter_mut()) {
        *out = ((*a as i32 + *b as i32) / 2) as i16;
    }
}
Enter fullscreen mode Exit fullscreen mode

std::simd

pub fn mix_tracks(track_a: &[i16], track_b: &[i16], output: &mut [i16]) {
    for i in (0..simd_len).step_by(8) {
        let mut a_vals = [0i32; 8];
        let mut b_vals = [0i32; 8];
        for j in 0..8 {
            a_vals[j] = track_a[i + j] as i32;
            b_vals[j] = track_b[i + j] as i32;
        }

        let a = i32x8::from_array(a_vals);
        let b = i32x8::from_array(b_vals);
        let mixed = (a + b) >> Simd::splat(1);

        for (j, &val) in mixed.to_array().iter().enumerate() {
            output[i + j] = val as i16;
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

NEON

pub fn mix_tracks(track_a: &[i16], track_b: &[i16], output: &mut [i16]) {
    unsafe {
        for i in (0..simd_len).step_by(8) {
            let a = vld1q_s16(track_a.as_ptr().add(i));
            let b = vld1q_s16(track_b.as_ptr().add(i));

            // vhaddq: halving add, computes (a + b) / 2 without overflow
            let mixed = vhaddq_s16(a, b);

            vst1q_s16(output.as_mut_ptr().add(i), mixed);
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Results

Implementation Time vs Scalar
Scalar 0.63ms 1.0x
std::simd 4.84ms 0.13x
NEON 0.53ms 1.20x

Analysis: vhaddq_s16 performs halving add in one instruction—designed specifically for audio mixing. std::simd requires widening to i32, adding, shifting, and narrowing back, plus three scalar loops for type conversions.


Scenario 4: Counting Newlines

Task: Count occurrences of \n in 10MB text.

This is where SIMD really shines—contiguous data with simple comparison. I used filter().count() for scalar, which Rust's iterator makes clean.

Scalar

pub fn count_byte(data: &[u8], target: u8) -> usize {
    data.iter().filter(|&&b| b == target).count()
}
Enter fullscreen mode Exit fullscreen mode

std::simd

pub fn count_byte(data: &[u8], target: u8) -> usize {
    let target_vec = u8x32::splat(target);
    let chunks = data.chunks_exact(32);
    let mut count = 0usize;

    for chunk in chunks {
        let v = u8x32::from_slice(chunk);
        let mask = v.simd_eq(target_vec);
        count += mask.to_bitmask().count_ones() as usize;
    }

    count += chunks.remainder().iter().filter(|&&b| b == target).count();
    count
}
Enter fullscreen mode Exit fullscreen mode

NEON

pub fn count_byte(data: &[u8], target: u8) -> usize {
    let mut count = 0usize;

    unsafe {
        let target_vec = vdupq_n_u8(target);

        for i in (0..simd_len).step_by(16) {
            let v = vld1q_u8(data.as_ptr().add(i));
            let eq = vceqq_u8(v, target_vec);
            let ones = vshrq_n_u8(eq, 7);
            count += vaddvq_u8(ones) as usize;
        }
    }

    count
}
Enter fullscreen mode Exit fullscreen mode

Results

Implementation Time vs Scalar
Scalar 14.10ms 1.0x
std::simd 3.22ms 4.38x
NEON 3.87ms 3.65x

Analysis: Contiguous data, simple comparison, no type conversions. Both SIMD approaches perform well. std::simd uses 256-bit vectors, NEON uses 128-bit, explaining the slight std::simd advantage.


Scenario 5: Finding a Byte

Task: Find position of @ in 10MB data (target near end).

I placed the target byte near the end to simulate worst-case search. The position() iterator is nice for scalar.

Scalar

pub fn find_byte(data: &[u8], target: u8) -> Option<usize> {
    data.iter().position(|&b| b == target)
}
Enter fullscreen mode Exit fullscreen mode

std::simd

pub fn find_byte(data: &[u8], target: u8) -> Option<usize> {
    let target_vec = u8x32::splat(target);
    let chunks = data.chunks_exact(32);

    for (chunk_idx, chunk) in chunks.enumerate() {
        let v = u8x32::from_slice(chunk);
        let mask = v.simd_eq(target_vec);
        let bitmask = mask.to_bitmask();
        if bitmask != 0 {
            return Some(chunk_idx * 32 + bitmask.trailing_zeros() as usize);
        }
    }

    None
}
Enter fullscreen mode Exit fullscreen mode

NEON

pub fn find_byte(data: &[u8], target: u8) -> Option<usize> {
    let len = data.len();
    let simd_len = len - (len % 16);

    unsafe {
        let target_vec = vdupq_n_u8(target);

        for i in (0..simd_len).step_by(16) {
            let v = vld1q_u8(data.as_ptr().add(i));
            let eq = vceqq_u8(v, target_vec);
            // vmaxvq_u8: if any 0xFF exists, returns 0xFF
            if vmaxvq_u8(eq) != 0 {
                // Found match, linear search for exact position
                for j in 0..16 {
                    if data[i + j] == target {
                        return Some(i + j);
                    }
                }
            }
        }
    }

    None
}
Enter fullscreen mode Exit fullscreen mode

Results

Implementation Time vs Scalar
Scalar 24.07ms 1.0x
std::simd 2.61ms 9.23x
NEON 2.68ms 9.00x

Analysis: Best-case SIMD scenario. Compare 32 bytes per iteration, early exit on match.


Scenario 6: Dot Product

Task: Dot product of two 10M-element f32 vectors.

Classic numerical computing workload. Rust's iterator chain makes the scalar version readable.

Scalar

pub fn dot_product(a: &[f32], b: &[f32]) -> f32 {
    a.iter().zip(b.iter()).map(|(x, y)| x * y).sum()
}
Enter fullscreen mode Exit fullscreen mode

std::simd

pub fn dot_product(a: &[f32], b: &[f32]) -> f32 {
    let chunks_a = a.chunks_exact(8);
    let chunks_b = b.chunks_exact(8);

    let mut acc = f32x8::splat(0.0);

    for (a_chunk, b_chunk) in chunks_a.zip(chunks_b) {
        let va = f32x8::from_slice(a_chunk);
        let vb = f32x8::from_slice(b_chunk);
        acc += va * vb;
    }

    acc.reduce_sum()
}
Enter fullscreen mode Exit fullscreen mode

NEON

pub fn dot_product(a: &[f32], b: &[f32]) -> f32 {
    unsafe {
        let mut acc = vdupq_n_f32(0.0);

        for i in (0..simd_len).step_by(4) {
            let va = vld1q_f32(a.as_ptr().add(i));
            let vb = vld1q_f32(b.as_ptr().add(i));
            acc = vfmaq_f32(acc, va, vb);
        }

        vaddvq_f32(acc)
    }
}
Enter fullscreen mode Exit fullscreen mode

Results

Implementation Time vs Scalar
Scalar 51.20ms 1.0x
std::simd 10.99ms 4.66x
NEON 20.14ms 2.54x

Analysis: std::simd outperforms hand-written NEON. std::simd uses f32x8 (256-bit), while the NEON implementation uses f32x4 (128-bit). The compiler generates efficient code from the portable abstraction.


Scenario 7: Matrix-Vector Multiplication

Task: 1024×1024 matrix times 1024-element vector.

I reused the dot product implementation row by row—keeps the code simple.

Scalar

pub fn matrix_vector_mul(matrix: &[f32], vector: &[f32], result: &mut [f32], rows: usize, cols: usize) {
    for i in 0..rows {
        let row_start = i * cols;
        result[i] = matrix[row_start..row_start + cols]
            .iter()
            .zip(vector.iter())
            .map(|(m, v)| m * v)
            .sum();
    }
}
Enter fullscreen mode Exit fullscreen mode

std::simd

pub fn matrix_vector_mul(matrix: &[f32], vector: &[f32], result: &mut [f32], rows: usize, cols: usize) {
    for i in 0..rows {
        let row = &matrix[i * cols..(i + 1) * cols];
        result[i] = dot_product(row, vector);  // reuse SIMD dot product
    }
}
Enter fullscreen mode Exit fullscreen mode

NEON

pub fn matrix_vector_mul(matrix: &[f32], vector: &[f32], result: &mut [f32], rows: usize, cols: usize) {
    for i in 0..rows {
        let row = &matrix[i * cols..(i + 1) * cols];
        result[i] = dot_product(row, vector);  // reuse NEON dot product
    }
}
Enter fullscreen mode Exit fullscreen mode

Results

Implementation Time vs Scalar
Scalar 4.48ms 1.0x
std::simd 0.69ms 6.53x
NEON 1.35ms 3.32x

Analysis: Regular memory access pattern, same characteristics as dot product.


Scenario 8: Range Check

Task: Verify all 10M i32 values are in [0, 100).

Early exit on failure makes this fast when data is invalid. I kept all values in range to measure worst-case (full scan).

Scalar

pub fn all_in_range(data: &[i32], min: i32, max: i32) -> bool {
    data.iter().all(|&x| x >= min && x <= max)
}
Enter fullscreen mode Exit fullscreen mode

std::simd

pub fn all_in_range(data: &[i32], min: i32, max: i32) -> bool {
    let min_vec = i32x8::splat(min);
    let max_vec = i32x8::splat(max);

    for chunk in data.chunks_exact(8) {
        let v = i32x8::from_slice(chunk);
        let ge_min = v.simd_ge(min_vec);
        let le_max = v.simd_le(max_vec);
        if !(ge_min & le_max).all() {
            return false;
        }
    }

    true
}
Enter fullscreen mode Exit fullscreen mode

NEON

pub fn all_in_range(data: &[i32], min: i32, max: i32) -> bool {
    let len = data.len();
    let simd_len = len - (len % 4);

    unsafe {
        let min_vec = vdupq_n_s32(min);
        let max_vec = vdupq_n_s32(max);

        for i in (0..simd_len).step_by(4) {
            let v = vld1q_s32(data.as_ptr().add(i));
            let ge_min = vcgeq_s32(v, min_vec);
            let le_max = vcleq_s32(v, max_vec);
            let both = vandq_u32(ge_min, le_max);
            // vminvq: if any 0 exists, returns 0
            if vminvq_u32(both) == 0 {
                return false;
            }
        }
    }

    true
}
Enter fullscreen mode Exit fullscreen mode

Results

Implementation Time vs Scalar
Scalar 24.93ms 1.0x
std::simd 8.24ms 3.02x
NEON 9.33ms 2.67x

Analysis: Simple comparisons on contiguous i32 data. No type conversions needed.


Scenario 9: Sorted Check

Task: Verify 10M i32 values are sorted ascending.

I used pre-sorted data to measure full-scan performance. The windows() iterator is convenient but has overhead.

Scalar

pub fn is_sorted(data: &[i32]) -> bool {
    data.windows(2).all(|w| w[0] <= w[1])
}
Enter fullscreen mode Exit fullscreen mode

std::simd

pub fn is_sorted(data: &[i32]) -> bool {
    for window in data.windows(9) {
        let current = i32x8::from_slice(&window[0..8]);
        let next = i32x8::from_slice(&window[1..9]);
        if !current.simd_le(next).all() {
            return false;
        }
    }
    true
}
Enter fullscreen mode Exit fullscreen mode

NEON

pub fn is_sorted(data: &[i32]) -> bool {
    unsafe {
        let mut i = 0;
        while i + 4 < data.len() {
            let current = vld1q_s32(data.as_ptr().add(i));
            let next = vld1q_s32(data.as_ptr().add(i + 1));
            let le = vcleq_s32(current, next);
            if vminvq_u32(le) == 0 {
                return false;
            }
            i += 4;
        }
    }
    true
}
Enter fullscreen mode Exit fullscreen mode

Results

Implementation Time vs Scalar
Scalar 25.14ms 1.0x
std::simd 37.18ms 0.68x
NEON 6.87ms 3.66x

Analysis: windows(9) iterator creates overlapping slices with overhead. NEON uses direct pointer arithmetic.


Root Cause Analysis

After digging into the assembly, the pattern became clear. std::simd exposes only operations available across all target platforms. ARM-specific instructions cannot be represented:

Operation std::simd NEON
Deinterleave load scalar loop vld3q_u8
Widen i16→i32 scalar loop vmovl_s16
Saturating narrow i32→i16 manual clamp vqmovn_s32
Halving add add + shift + narrow vhaddq_s16

When these operations are needed, std::simd falls back to scalar loops, negating SIMD benefits and adding overhead.


NEON Instruction Reference

Naming pattern:

vld3q_u8
│││││└─ u8: data type
││││└── q: 128-bit register
│││└─── 3: 3-way deinterleave
││└──── ld: load
│└───── v: vector
Enter fullscreen mode Exit fullscreen mode
Instruction Operation
vld1q_u8 Load 16 bytes
vld3q_u8 Load + deinterleave RGB
vmull_u8 Widening multiply u8→u16
vmovl_s16 Widen i16→i32
vqmovn_s32 Saturating narrow i32→i16
vfmaq_f32 Fused multiply-add
vhaddq_s16 Halving add
vaddvq_f32 Horizontal sum

Recommendations

Data Pattern Approach
Contiguous f32/i32 arrays std::simd
Interleaved data (RGB, stereo audio) Platform intrinsics
Operations requiring type conversion Platform intrinsics
Cross-platform library std::simd + scalar fallback
Maximum ARM performance NEON intrinsics

Conclusion

My takeaway: std::simd is great for numerical workloads on contiguous data—often better than hand-written intrinsics because the compiler knows optimization tricks I don't.

But for image and audio processing, std::simd falls short. The portable abstraction cannot express instructions like vld3q_u8 or vhaddq_s16 that ARM provides specifically for these workloads.

If we're targeting ARM and working with interleaved data, NEON intrinsics remain the way to go.


References


Source

Full source code: github.com/Erio-Harrison/simd_benchmark

Top comments (0)