kojix2

Posted on Aug 7

Writing SIMD in Crystal with Inline Assembly

#crystal #assembly

Introduction

In this article, we explore how to write SIMD instructions—SSE for x86\64 and NEON for AArch64—using inline assembly in the Crystal programming language.
Crystal uses LLVM as its backend, but it doesn’t yet fully optimize with SIMD.
This is not a performance tuning guide, but rather a fun exploration into low-level programming with Crystal.

`asm` Syntax

Crystal provides the asm keyword for writing inline assembly. The syntax is based on LLVM's integrated assembler.

asm("template" : outputs : inputs : clobbers : flags)

Each section:

template: LLVM-style assembly code
outputs: Output operands
inputs: Input operands
clobbers: Registers that will be modified
flags: Optional (e.g., "volatile")

For a detailed explanation, see the official docs

Types of SIMD Instructions

SSE / AVX for Intel and AMD CPUs (x86_64)
NEON for ARM CPUs (like Apple Silicon)

Types of Registers

Registers Used in x86_64

General-purpose: rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, r8–r15
SIMD:

Name	Width	Instruction Set	Usage
`xmm0–xmm15`	128-bit	SSE	Floats, ints
`ymm0–ymm15`	256-bit	AVX	Wider SIMD
`zmm0–zmm31`	512-bit	AVX-512	Used in newer CPUs

Registers Used in AArch64 (NEON)

Vector registers: v0–v31
- v0.4s = 4 × 32-bit floats
- v1.8h = 8 × 16-bit half-precision floats

Examples of Register Specification

SSE: xmm0, xmm1, etc.
NEON: v0.4s, v1.8h, etc.

Note:

LLVM assigns SSE registers automatically
NEON requires explicit register naming in inline assembly

Prerequisites

To follow along:

Emit LLVM IR:

  crystal build --emit llvm-ir foo.cr

Emit assembly:

  crystal build --emit asm foo.cr

Benchmarking tool: hyperfine
Use of uninitialized and to_unsafe for low-level memory access

Basic Vector Operations

Vector Addition

SSE (x86_64)

a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32]
b = StaticArray[5.0_f32, 6.0_f32, 7.0_f32, 8.0_f32]

def simd_vector_add(a : StaticArray(Float32, 4), b : StaticArray(Float32, 4)) : StaticArray(Float32, 4)
  result = uninitialized StaticArray(Float32, 4)
  a_ptr = a.to_unsafe
  b_ptr = b.to_unsafe
  result_ptr = result.to_unsafe

  asm(
    "movups ($1), %xmm0      // load vector a into xmm0
     movups ($2), %xmm1      // load vector b into xmm1
     addps %xmm1, %xmm0      // perform parallel addition of four 32-bit floats
     movups %xmm0, ($0)      // store result to memory"
          :: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr)
          : "xmm0", "xmm1", "memory"
          : "volatile"
  )

  result
end

puts "Vector addition: #{simd_vector_add(a, b)}"

NEON (AArch64)

a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32]
b = StaticArray[5.0_f32, 6.0_f32, 7.0_f32, 8.0_f32]

def simd_vector_add(a : StaticArray(Float32, 4), b : StaticArray(Float32, 4)) : StaticArray(Float32, 4)
  result = uninitialized StaticArray(Float32, 4)
  a_ptr = a.to_unsafe
  b_ptr = b.to_unsafe
  result_ptr = result.to_unsafe

  asm(
    "ld1 {v0.4s}, [$1]        // load vector a
     ld1 {v1.4s}, [$2]        // load vector b
     fadd v2.4s, v0.4s, v1.4s // add each element
     st1 {v2.4s}, [$0]        // store the result"
          :: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr)
          : "v0", "v1", "v2", "memory"
          : "volatile"
  )

  result
end

puts "Vector addition: #{simd_vector_add(a, b)}"

Vector Multiplication

SSE (x86_64)

a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32]
b = StaticArray[5.0_f32, 6.0_f32, 7.0_f32, 8.0_f32]

def simd_vector_multiply(a : StaticArray(Float32, 4), b : StaticArray(Float32, 4)) : StaticArray(Float32, 4)
  result = uninitialized StaticArray(Float32, 4)
  a_ptr = a.to_unsafe
  b_ptr = b.to_unsafe
  result_ptr = result.to_unsafe

  asm(
    "movups ($1), %xmm0      // load vector a into xmm0
     movups ($2), %xmm1      // load vector b into xmm1
     mulps %xmm1, %xmm0      // perform parallel multiplication of four 32-bit floats
     movups %xmm0, ($0)      // store result to memory"
          :: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr)
          : "xmm0", "xmm1", "memory"
          : "volatile"
  )

  result
end

puts "Vector multiplication: #{simd_vector_multiply(a, b)}"

NEON (AArch64)

a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32]
b = StaticArray[5.0_f32, 6.0_f32, 7.0_f32, 8.0_f32]

def simd_vector_multiply(a : StaticArray(Float32, 4), b : StaticArray(Float32, 4)) : StaticArray(Float32, 4)
  result = uninitialized StaticArray(Float32, 4)
  a_ptr = a.to_unsafe
  b_ptr = b.to_unsafe
  result_ptr = result.to_unsafe

  asm(
    "ld1 {v0.4s}, [$1]        // load vector a
     ld1 {v1.4s}, [$2]        // load vector b
     fmul v2.4s, v0.4s, v1.4s // multiply each element
     st1 {v2.4s}, [$0]        // store the result"
          :: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr)
          : "v0", "v1", "v2", "memory"
          : "volatile"
  )

  result
end

puts "Vector multiplication: #{simd_vector_multiply(a, b)}"

Aggregation Operations

Vector Sum

SSE (x86_64)

a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32]

def simd_vector_sum(vec : StaticArray(Float32, 4)) : Float32
  result = uninitialized Float32
  vec_ptr = vec.to_unsafe
  result_ptr = pointerof(result)

  asm(
    "movups ($1), %xmm0      // load vector into xmm0
     haddps %xmm0, %xmm0     // horizontal add: [a+b, c+d, a+b, c+d]
     haddps %xmm0, %xmm0     // horizontal add again: [a+b+c+d, *, *, *]
     movss %xmm0, ($0)       // store the first element of result"
          :: "r"(result_ptr), "r"(vec_ptr)
          : "xmm0", "memory"
          : "volatile"
  )

  result
end

puts "Vector sum: #{simd_vector_sum(a)}"

NEON (AArch64)

a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32]

def simd_vector_sum(vec : StaticArray(Float32, 4)) : Float32
  result = uninitialized Float32
  vec_ptr = vec.to_unsafe
  result_ptr = pointerof(result)

  asm(
    "ld1 {v0.4s}, [$1]         // load vector
     faddp v1.4s, v0.4s, v0.4s // pairwise add: [a+b, c+d, a+b, c+d]
     faddp v2.2s, v1.2s, v1.2s // pairwise add again: [a+b+c+d, *]
     str s2, [$0]              // store the final sum"
          :: "r"(result_ptr), "r"(vec_ptr)
          : "v0", "v1", "v2", "memory"
          : "volatile"
  )

  result
end

puts "Vector sum: #{simd_vector_sum(a)}"

Finding Maximum Value

SSE (x86_64)

a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32]

def simd_vector_max(vec : StaticArray(Float32, 4)) : Float32
  result = uninitialized Float32
  vec_ptr = vec.to_unsafe
  result_ptr = pointerof(result)

  asm(
    "movups ($1), %xmm0          // load vector into xmm0
     movaps %xmm0, %xmm1         // copy xmm0 to xmm1
     shufps $$0x4E, %xmm1, %xmm1 // swap upper and lower pairs
     maxps %xmm1, %xmm0          // compute max of each pair
     movaps %xmm0, %xmm1         // copy result to xmm1
     shufps $$0x01, %xmm1, %xmm1 // shuffle adjacent elements
     maxps %xmm1, %xmm0          // compute final max
     movss %xmm0, ($0)           // store the result"
          :: "r"(result_ptr), "r"(vec_ptr)
          : "xmm0", "xmm1", "memory"
          : "volatile"
  )

  result
end

puts "Vector max: #{simd_vector_max(a)}"

NEON (AArch64)

a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32]

def simd_vector_max(vec : StaticArray(Float32, 4)) : Float32
  result = uninitialized Float32
  vec_ptr = vec.to_unsafe
  result_ptr = pointerof(result)

  asm(
    "ld1 {v0.4s}, [$1]         // load vector
     fmaxp v1.4s, v0.4s, v0.4s // pairwise max: [max(a, b), max(c, d), ...]
     fmaxp v2.2s, v1.2s, v1.2s // final pairwise max
     str s2, [$0]              // store result"
          :: "r"(result_ptr), "r"(vec_ptr)
          : "v0", "v1", "v2", "memory"
          : "volatile"
  )

  result
end

puts "Vector max: #{simd_vector_max(a)}"

Integer Operations

Integer Addition

SSE (x86_64)

int_a = StaticArray[1, 2, 3, 4]
int_b = StaticArray[10, 20, 30, 40]

def simd_int_add(a : StaticArray(Int32, 4), b : StaticArray(Int32, 4)) : StaticArray(Int32, 4)
  result = uninitialized StaticArray(Int32, 4)
  a_ptr = a.to_unsafe
  b_ptr = b.to_unsafe
  result_ptr = result.to_unsafe

  asm(
    "movdqu ($1), %xmm0      // load integer vector a into xmm0
     movdqu ($2), %xmm1      // load integer vector b into xmm1
     paddd %xmm1, %xmm0      // perform parallel addition of four 32-bit integers
     movdqu %xmm0, ($0)      // store result to memory"
          :: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr)
          : "xmm0", "xmm1", "memory"
          : "volatile"
  )

  result
end

puts "Integer addition: #{simd_int_add(int_a, int_b)}"

NEON (AArch64)

int_a = StaticArray[1, 2, 3, 4]
int_b = StaticArray[10, 20, 30, 40]

def simd_int_add(a : StaticArray(Int32, 4), b : StaticArray(Int32, 4)) : StaticArray(Int32, 4)
  result = uninitialized StaticArray(Int32, 4)
  a_ptr = a.to_unsafe
  b_ptr = b.to_unsafe
  result_ptr = result.to_unsafe

  asm(
    "ld1 {v0.4s}, [$1]        // load integer vector a
     ld1 {v1.4s}, [$2]        // load integer vector b
     add v2.4s, v0.4s, v1.4s  // perform element-wise addition
     st1 {v2.4s}, [$0]        // store result to memory"
          :: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr)
          : "v0", "v1", "v2", "memory"
          : "volatile"
  )

  result
end

puts "Integer addition: #{simd_int_add(int_a, int_b)}"

Saturated Addition

SSE (x86_64)

sat_a = StaticArray[29_000_i16, 30_000_i16, 31_000_i16, 32_000_i16,
  32_000_i16, 32_000_i16, 32_000_i16, 32_000_i16]
sat_b = StaticArray[1_000_i16, 1_000_i16, 1_000_i16, 1_000_i16,
  500_i16, 600_i16, 700_i16, 800_i16]

def simd_saturated_add(a : StaticArray(Int16, 8), b : StaticArray(Int16, 8)) : StaticArray(Int16, 8)
  result = uninitialized StaticArray(Int16, 8)
  a_ptr = a.to_unsafe
  b_ptr = b.to_unsafe
  result_ptr = result.to_unsafe

  asm(
    "movdqu ($1), %xmm0      // load 8 × 16-bit integers into xmm0
     movdqu ($2), %xmm1      // load 8 × 16-bit integers into xmm1
     paddsw %xmm1, %xmm0     // perform saturated addition
     movdqu %xmm0, ($0)      // store result to memory"
          :: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr)
          : "xmm0", "xmm1", "memory"
          : "volatile"
  )

  result
end

puts "Saturated addition: #{simd_saturated_add(sat_a, sat_b)}"

NEON (AArch64)

sat_a = StaticArray[29_000_i16, 30_000_i16, 31_000_i16, 32_000_i16,
  32_000_i16, 32_000_i16, 32_000_i16, 32_000_i16]
sat_b = StaticArray[1_000_i16, 1_000_i16, 1_000_i16, 1_000_i16,
  500_i16, 600_i16, 700_i16, 800_i16]

def simd_saturated_add(a : StaticArray(Int16, 8), b : StaticArray(Int16, 8)) : StaticArray(Int16, 8)
  result = uninitialized StaticArray(Int16, 8)
  a_ptr = a.to_unsafe
  b_ptr = b.to_unsafe
  result_ptr = result.to_unsafe

  asm(
    "ld1 {v0.8h}, [$1]          // load 8 × 16-bit integers from a into v0
     ld1 {v1.8h}, [$2]          // load 8 × 16-bit integers from b into v1
     sqadd v2.8h, v0.8h, v1.8h  // perform saturated addition
     st1 {v2.8h}, [$0]          // store result to memory"
          :: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr)
          : "v0", "v1", "v2", "memory"
          : "volatile"
  )

  result
end

puts "Saturated addition: #{simd_saturated_add(sat_a, sat_b)}"

Examining LLVM-IR and Assembly

To inspect LLVM IR output:

crystal build your_file.cr --emit llvm-ir --no-debug

To inspect raw assembly:

crystal build your_file.cr --emit asm --no-debug

You’ll see that your inline asm blocks are preserved as-is, even with optimizations (-O3).

__crystal_once.exit.i.i:                          ; preds = %else.i.i.i, %.noexc98
  call void @llvm.lifetime.start.p0(i64 16, ptr nonnull %path.i.i.i.i.i)
  call void @llvm.lifetime.start.p0(i64 16, ptr nonnull %obj1.i.i.i.i)
  call void @llvm.lifetime.start.p0(i64 16, ptr nonnull %b2.i.i.i)
  store <4 x float> <float 1.000000e+00, float 2.000000e+00, float 3.000000e+00, float 4.000000e+00>, ptr %obj1.i.i.i.i, align 16
  store <4 x float> <float 5.000000e+00, float 6.000000e+00, float 7.000000e+00, float 8.000000e+00>, ptr %b2.i.i.i, align 16
  call void asm sideeffect "ld1 {v0.4s}, [$1] \0Ald1 {v1.4s}, [$2] \0Afadd v2.4s, v0.4s, v1.4s \0Ast1 {v2.4s}, [$0]", "r,r,r,~{v0},~{v1},~{v2},~{memory}"(ptr nonnull %path.i.i.i.i.i, ptr nonnull %obj1.i.i.i.i, ptr nonnull %b2.i.i.i) #30
  %314 = load <4 x float>, ptr %path.i.i.i.i.i, align 16
  call void @llvm.lifetime.end.p0(i64 16, ptr nonnull %path.i.i.i.i.i)
  call void @llvm.lifetime.end.p0(i64 16, ptr nonnull %obj1.i.i.i.i)
  call void @llvm.lifetime.end.p0(i64 16, ptr nonnull %b2.i.i.i)
  %315 = invoke ptr @GC_malloc(i64 80)
          to label %.noexc100 unwind label %rescue2.loopexit.split-lp.loopexit.split-lp.loopexit.split-lp

Lloh2300:
        ldr     q1, [x9, lCPI312_43@PAGEOFF]
        add     x8, sp, #164
        add     x9, sp, #128
        str     q0, [sp, #128]
        stur    q1, [x29, #-128]
        ; InlineAsm Start
        ld1.4s  { v0 }, [x9]
        ld1.4s  { v1 }, [x10]
        fadd.4s v2, v0, v1
        st1.4s  { v2 }, [x8]
        ; InlineAsm End
        ldr     q0, [x25]
        str     q0, [sp, #16]

Miscellaneous

When using SIMD with parallelism, memory bandwidth can become the bottleneck.
Although Crystal currently runs single-threaded by default, true parallelism is in progress, and memory limitations may become relevant in the future.

Conclusion

We’ve explored how to write SIMD operations in Crystal using inline asm, and examined how those instructions are lowered into LLVM IR and eventually into assembly.

This was a deep dive into low-level Crystal.

Appendix: SIMD Instruction Reference

SSE (x86_64)

Instruction	Description
`movups`	Load/store 4 × Float32 (unaligned)
`movaps`	Load/store 4 × Float32 (aligned)
`movdqu`	Load/store 4 × Int32 or 8 × Int16
`movss`	Store scalar Float32 (lowest lane)
`addps`	Add 4 × Float32
`mulps`	Multiply 4 × Float32
`paddd`	Add 4 × Int32
`paddsw`	Saturated add 8 × Int16
`haddps`	Horizontal add of Float32 pairs
`maxps`	Element-wise max (Float32)
`shufps`	Shuffle Float32 lanes (for reduction)

NEON (AArch64)

Instruction	Description
`ld1`	Load vector (e.g. `v0.4s`, `v0.8h`)
`st1`	Store vector
`add`	Add 4 × Int32
`sqadd`	Saturated add 8 × Int16
`fadd`	Add 4 × Float32
`fmul`	Multiply 4 × Float32
`faddp`	Pairwise add (Float32 reduction)
`fmaxp`	Pairwise max (Float32 reduction)
`faddv`	Vector-wide add (optional)
`fmaxv`	Vector-wide max (optional)

Notes

SSE's movaps and movdqa require 16-byte alignment.
NEON's faddp, fmaxp reduce in two steps: 4 → 2 → 1.
shufps is used with masks like 0x4E, 0x01 for reordering lanes during reduction.
Saturated arithmetic (paddsw, sqadd) clamps values on overflow.

Thanks for reading — and happy crystaling! 💎

DEV Community

Writing SIMD in Crystal with Inline Assembly

Introduction

`asm` Syntax

Types of SIMD Instructions

Types of Registers

Registers Used in x86_64

Registers Used in AArch64 (NEON)

Examples of Register Specification

Prerequisites

Basic Vector Operations

Vector Addition

SSE (x86_64)

NEON (AArch64)

Vector Multiplication

SSE (x86_64)

NEON (AArch64)

Aggregation Operations

Vector Sum

SSE (x86_64)

NEON (AArch64)

Finding Maximum Value

SSE (x86_64)

NEON (AArch64)

Integer Operations

Integer Addition

SSE (x86_64)

NEON (AArch64)

Saturated Addition

SSE (x86_64)

NEON (AArch64)

Examining LLVM-IR and Assembly

Miscellaneous

Conclusion

Appendix: SIMD Instruction Reference

SSE (x86_64)

NEON (AArch64)

Notes

Top comments (0)

Introduction

asm Syntax

Types of SIMD Instructions

Types of Registers

Registers Used in x86_64

Registers Used in AArch64 (NEON)

Examples of Register Specification

Prerequisites

Basic Vector Operations

Vector Addition

SSE (x86_64)

NEON (AArch64)

Vector Multiplication

SSE (x86_64)

NEON (AArch64)

Aggregation Operations

Vector Sum

SSE (x86_64)

NEON (AArch64)

Finding Maximum Value

SSE (x86_64)

NEON (AArch64)

Integer Operations

Integer Addition

SSE (x86_64)

NEON (AArch64)

Saturated Addition

SSE (x86_64)

NEON (AArch64)

Examining LLVM-IR and Assembly

Miscellaneous

Conclusion

Appendix: SIMD Instruction Reference

SSE (x86_64)

NEON (AArch64)

Notes

`asm` Syntax