Introduction
In this article, we explore how to write SIMD instructions—SSE for x86\64 and NEON for AArch64—using inline assembly in the Crystal programming language.
Crystal uses LLVM as its backend, but it doesn’t yet fully optimize with SIMD.
This is not a performance tuning guide, but rather a fun exploration into low-level programming with Crystal.
asm
Syntax
Crystal provides the asm
keyword for writing inline assembly. The syntax is based on LLVM's integrated assembler.
asm("template" : outputs : inputs : clobbers : flags)
Each section:
-
template
: LLVM-style assembly code -
outputs
: Output operands -
inputs
: Input operands -
clobbers
: Registers that will be modified -
flags
: Optional (e.g.,"volatile"
)
For a detailed explanation, see the official docs
Types of SIMD Instructions
- SSE / AVX for Intel and AMD CPUs (x86_64)
- NEON for ARM CPUs (like Apple Silicon)
Types of Registers
Registers Used in x86_64
- General-purpose:
rax
,rbx
,rcx
,rdx
,rsi
,rdi
,rsp
,rbp
,r8–r15
- SIMD:
Name | Width | Instruction Set | Usage |
---|---|---|---|
xmm0–xmm15 |
128-bit | SSE | Floats, ints |
ymm0–ymm15 |
256-bit | AVX | Wider SIMD |
zmm0–zmm31 |
512-bit | AVX-512 | Used in newer CPUs |
Registers Used in AArch64 (NEON)
-
Vector registers:
v0
–v31
-
v0.4s
= 4 × 32-bit floats -
v1.8h
= 8 × 16-bit half-precision floats
-
Examples of Register Specification
- SSE:
xmm0
,xmm1
, etc. - NEON:
v0.4s
,v1.8h
, etc.
Note:
- LLVM assigns SSE registers automatically
- NEON requires explicit register naming in inline assembly
Prerequisites
To follow along:
- Emit LLVM IR:
crystal build --emit llvm-ir foo.cr
- Emit assembly:
crystal build --emit asm foo.cr
Benchmarking tool:
hyperfine
Use of
uninitialized
andto_unsafe
for low-level memory access
Basic Vector Operations
Vector Addition
SSE (x86_64)
a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32]
b = StaticArray[5.0_f32, 6.0_f32, 7.0_f32, 8.0_f32]
def simd_vector_add(a : StaticArray(Float32, 4), b : StaticArray(Float32, 4)) : StaticArray(Float32, 4)
result = uninitialized StaticArray(Float32, 4)
a_ptr = a.to_unsafe
b_ptr = b.to_unsafe
result_ptr = result.to_unsafe
asm(
"movups ($1), %xmm0 // load vector a into xmm0
movups ($2), %xmm1 // load vector b into xmm1
addps %xmm1, %xmm0 // perform parallel addition of four 32-bit floats
movups %xmm0, ($0) // store result to memory"
:: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr)
: "xmm0", "xmm1", "memory"
: "volatile"
)
result
end
puts "Vector addition: #{simd_vector_add(a, b)}"
NEON (AArch64)
a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32]
b = StaticArray[5.0_f32, 6.0_f32, 7.0_f32, 8.0_f32]
def simd_vector_add(a : StaticArray(Float32, 4), b : StaticArray(Float32, 4)) : StaticArray(Float32, 4)
result = uninitialized StaticArray(Float32, 4)
a_ptr = a.to_unsafe
b_ptr = b.to_unsafe
result_ptr = result.to_unsafe
asm(
"ld1 {v0.4s}, [$1] // load vector a
ld1 {v1.4s}, [$2] // load vector b
fadd v2.4s, v0.4s, v1.4s // add each element
st1 {v2.4s}, [$0] // store the result"
:: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr)
: "v0", "v1", "v2", "memory"
: "volatile"
)
result
end
puts "Vector addition: #{simd_vector_add(a, b)}"
Vector Multiplication
SSE (x86_64)
a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32]
b = StaticArray[5.0_f32, 6.0_f32, 7.0_f32, 8.0_f32]
def simd_vector_multiply(a : StaticArray(Float32, 4), b : StaticArray(Float32, 4)) : StaticArray(Float32, 4)
result = uninitialized StaticArray(Float32, 4)
a_ptr = a.to_unsafe
b_ptr = b.to_unsafe
result_ptr = result.to_unsafe
asm(
"movups ($1), %xmm0 // load vector a into xmm0
movups ($2), %xmm1 // load vector b into xmm1
mulps %xmm1, %xmm0 // perform parallel multiplication of four 32-bit floats
movups %xmm0, ($0) // store result to memory"
:: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr)
: "xmm0", "xmm1", "memory"
: "volatile"
)
result
end
puts "Vector multiplication: #{simd_vector_multiply(a, b)}"
NEON (AArch64)
a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32]
b = StaticArray[5.0_f32, 6.0_f32, 7.0_f32, 8.0_f32]
def simd_vector_multiply(a : StaticArray(Float32, 4), b : StaticArray(Float32, 4)) : StaticArray(Float32, 4)
result = uninitialized StaticArray(Float32, 4)
a_ptr = a.to_unsafe
b_ptr = b.to_unsafe
result_ptr = result.to_unsafe
asm(
"ld1 {v0.4s}, [$1] // load vector a
ld1 {v1.4s}, [$2] // load vector b
fmul v2.4s, v0.4s, v1.4s // multiply each element
st1 {v2.4s}, [$0] // store the result"
:: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr)
: "v0", "v1", "v2", "memory"
: "volatile"
)
result
end
puts "Vector multiplication: #{simd_vector_multiply(a, b)}"
Aggregation Operations
Vector Sum
SSE (x86_64)
a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32]
def simd_vector_sum(vec : StaticArray(Float32, 4)) : Float32
result = uninitialized Float32
vec_ptr = vec.to_unsafe
result_ptr = pointerof(result)
asm(
"movups ($1), %xmm0 // load vector into xmm0
haddps %xmm0, %xmm0 // horizontal add: [a+b, c+d, a+b, c+d]
haddps %xmm0, %xmm0 // horizontal add again: [a+b+c+d, *, *, *]
movss %xmm0, ($0) // store the first element of result"
:: "r"(result_ptr), "r"(vec_ptr)
: "xmm0", "memory"
: "volatile"
)
result
end
puts "Vector sum: #{simd_vector_sum(a)}"
NEON (AArch64)
a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32]
def simd_vector_sum(vec : StaticArray(Float32, 4)) : Float32
result = uninitialized Float32
vec_ptr = vec.to_unsafe
result_ptr = pointerof(result)
asm(
"ld1 {v0.4s}, [$1] // load vector
faddp v1.4s, v0.4s, v0.4s // pairwise add: [a+b, c+d, a+b, c+d]
faddp v2.2s, v1.2s, v1.2s // pairwise add again: [a+b+c+d, *]
str s2, [$0] // store the final sum"
:: "r"(result_ptr), "r"(vec_ptr)
: "v0", "v1", "v2", "memory"
: "volatile"
)
result
end
puts "Vector sum: #{simd_vector_sum(a)}"
Finding Maximum Value
SSE (x86_64)
a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32]
def simd_vector_max(vec : StaticArray(Float32, 4)) : Float32
result = uninitialized Float32
vec_ptr = vec.to_unsafe
result_ptr = pointerof(result)
asm(
"movups ($1), %xmm0 // load vector into xmm0
movaps %xmm0, %xmm1 // copy xmm0 to xmm1
shufps $$0x4E, %xmm1, %xmm1 // swap upper and lower pairs
maxps %xmm1, %xmm0 // compute max of each pair
movaps %xmm0, %xmm1 // copy result to xmm1
shufps $$0x01, %xmm1, %xmm1 // shuffle adjacent elements
maxps %xmm1, %xmm0 // compute final max
movss %xmm0, ($0) // store the result"
:: "r"(result_ptr), "r"(vec_ptr)
: "xmm0", "xmm1", "memory"
: "volatile"
)
result
end
puts "Vector max: #{simd_vector_max(a)}"
NEON (AArch64)
a = StaticArray[1.0_f32, 2.0_f32, 3.0_f32, 4.0_f32]
def simd_vector_max(vec : StaticArray(Float32, 4)) : Float32
result = uninitialized Float32
vec_ptr = vec.to_unsafe
result_ptr = pointerof(result)
asm(
"ld1 {v0.4s}, [$1] // load vector
fmaxp v1.4s, v0.4s, v0.4s // pairwise max: [max(a, b), max(c, d), ...]
fmaxp v2.2s, v1.2s, v1.2s // final pairwise max
str s2, [$0] // store result"
:: "r"(result_ptr), "r"(vec_ptr)
: "v0", "v1", "v2", "memory"
: "volatile"
)
result
end
puts "Vector max: #{simd_vector_max(a)}"
Integer Operations
Integer Addition
SSE (x86_64)
int_a = StaticArray[1, 2, 3, 4]
int_b = StaticArray[10, 20, 30, 40]
def simd_int_add(a : StaticArray(Int32, 4), b : StaticArray(Int32, 4)) : StaticArray(Int32, 4)
result = uninitialized StaticArray(Int32, 4)
a_ptr = a.to_unsafe
b_ptr = b.to_unsafe
result_ptr = result.to_unsafe
asm(
"movdqu ($1), %xmm0 // load integer vector a into xmm0
movdqu ($2), %xmm1 // load integer vector b into xmm1
paddd %xmm1, %xmm0 // perform parallel addition of four 32-bit integers
movdqu %xmm0, ($0) // store result to memory"
:: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr)
: "xmm0", "xmm1", "memory"
: "volatile"
)
result
end
puts "Integer addition: #{simd_int_add(int_a, int_b)}"
NEON (AArch64)
int_a = StaticArray[1, 2, 3, 4]
int_b = StaticArray[10, 20, 30, 40]
def simd_int_add(a : StaticArray(Int32, 4), b : StaticArray(Int32, 4)) : StaticArray(Int32, 4)
result = uninitialized StaticArray(Int32, 4)
a_ptr = a.to_unsafe
b_ptr = b.to_unsafe
result_ptr = result.to_unsafe
asm(
"ld1 {v0.4s}, [$1] // load integer vector a
ld1 {v1.4s}, [$2] // load integer vector b
add v2.4s, v0.4s, v1.4s // perform element-wise addition
st1 {v2.4s}, [$0] // store result to memory"
:: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr)
: "v0", "v1", "v2", "memory"
: "volatile"
)
result
end
puts "Integer addition: #{simd_int_add(int_a, int_b)}"
Saturated Addition
SSE (x86_64)
sat_a = StaticArray[29_000_i16, 30_000_i16, 31_000_i16, 32_000_i16,
32_000_i16, 32_000_i16, 32_000_i16, 32_000_i16]
sat_b = StaticArray[1_000_i16, 1_000_i16, 1_000_i16, 1_000_i16,
500_i16, 600_i16, 700_i16, 800_i16]
def simd_saturated_add(a : StaticArray(Int16, 8), b : StaticArray(Int16, 8)) : StaticArray(Int16, 8)
result = uninitialized StaticArray(Int16, 8)
a_ptr = a.to_unsafe
b_ptr = b.to_unsafe
result_ptr = result.to_unsafe
asm(
"movdqu ($1), %xmm0 // load 8 × 16-bit integers into xmm0
movdqu ($2), %xmm1 // load 8 × 16-bit integers into xmm1
paddsw %xmm1, %xmm0 // perform saturated addition
movdqu %xmm0, ($0) // store result to memory"
:: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr)
: "xmm0", "xmm1", "memory"
: "volatile"
)
result
end
puts "Saturated addition: #{simd_saturated_add(sat_a, sat_b)}"
NEON (AArch64)
sat_a = StaticArray[29_000_i16, 30_000_i16, 31_000_i16, 32_000_i16,
32_000_i16, 32_000_i16, 32_000_i16, 32_000_i16]
sat_b = StaticArray[1_000_i16, 1_000_i16, 1_000_i16, 1_000_i16,
500_i16, 600_i16, 700_i16, 800_i16]
def simd_saturated_add(a : StaticArray(Int16, 8), b : StaticArray(Int16, 8)) : StaticArray(Int16, 8)
result = uninitialized StaticArray(Int16, 8)
a_ptr = a.to_unsafe
b_ptr = b.to_unsafe
result_ptr = result.to_unsafe
asm(
"ld1 {v0.8h}, [$1] // load 8 × 16-bit integers from a into v0
ld1 {v1.8h}, [$2] // load 8 × 16-bit integers from b into v1
sqadd v2.8h, v0.8h, v1.8h // perform saturated addition
st1 {v2.8h}, [$0] // store result to memory"
:: "r"(result_ptr), "r"(a_ptr), "r"(b_ptr)
: "v0", "v1", "v2", "memory"
: "volatile"
)
result
end
puts "Saturated addition: #{simd_saturated_add(sat_a, sat_b)}"
Examining LLVM-IR and Assembly
To inspect LLVM IR output:
crystal build your_file.cr --emit llvm-ir --no-debug
To inspect raw assembly:
crystal build your_file.cr --emit asm --no-debug
You’ll see that your inline asm
blocks are preserved as-is, even with optimizations (-O3
).
__crystal_once.exit.i.i: ; preds = %else.i.i.i, %.noexc98
call void @llvm.lifetime.start.p0(i64 16, ptr nonnull %path.i.i.i.i.i)
call void @llvm.lifetime.start.p0(i64 16, ptr nonnull %obj1.i.i.i.i)
call void @llvm.lifetime.start.p0(i64 16, ptr nonnull %b2.i.i.i)
store <4 x float> <float 1.000000e+00, float 2.000000e+00, float 3.000000e+00, float 4.000000e+00>, ptr %obj1.i.i.i.i, align 16
store <4 x float> <float 5.000000e+00, float 6.000000e+00, float 7.000000e+00, float 8.000000e+00>, ptr %b2.i.i.i, align 16
call void asm sideeffect "ld1 {v0.4s}, [$1] \0Ald1 {v1.4s}, [$2] \0Afadd v2.4s, v0.4s, v1.4s \0Ast1 {v2.4s}, [$0]", "r,r,r,~{v0},~{v1},~{v2},~{memory}"(ptr nonnull %path.i.i.i.i.i, ptr nonnull %obj1.i.i.i.i, ptr nonnull %b2.i.i.i) #30
%314 = load <4 x float>, ptr %path.i.i.i.i.i, align 16
call void @llvm.lifetime.end.p0(i64 16, ptr nonnull %path.i.i.i.i.i)
call void @llvm.lifetime.end.p0(i64 16, ptr nonnull %obj1.i.i.i.i)
call void @llvm.lifetime.end.p0(i64 16, ptr nonnull %b2.i.i.i)
%315 = invoke ptr @GC_malloc(i64 80)
to label %.noexc100 unwind label %rescue2.loopexit.split-lp.loopexit.split-lp.loopexit.split-lp
Lloh2300:
ldr q1, [x9, lCPI312_43@PAGEOFF]
add x8, sp, #164
add x9, sp, #128
str q0, [sp, #128]
stur q1, [x29, #-128]
; InlineAsm Start
ld1.4s { v0 }, [x9]
ld1.4s { v1 }, [x10]
fadd.4s v2, v0, v1
st1.4s { v2 }, [x8]
; InlineAsm End
ldr q0, [x25]
str q0, [sp, #16]
Miscellaneous
When using SIMD with parallelism, memory bandwidth can become the bottleneck.
Although Crystal currently runs single-threaded by default, true parallelism is in progress, and memory limitations may become relevant in the future.
Conclusion
We’ve explored how to write SIMD operations in Crystal using inline asm
, and examined how those instructions are lowered into LLVM IR and eventually into assembly.
This was a deep dive into low-level Crystal.
Appendix: SIMD Instruction Reference
SSE (x86_64)
Instruction | Description |
---|---|
movups |
Load/store 4 × Float32 (unaligned) |
movaps |
Load/store 4 × Float32 (aligned) |
movdqu |
Load/store 4 × Int32 or 8 × Int16 |
movss |
Store scalar Float32 (lowest lane) |
addps |
Add 4 × Float32 |
mulps |
Multiply 4 × Float32 |
paddd |
Add 4 × Int32 |
paddsw |
Saturated add 8 × Int16 |
haddps |
Horizontal add of Float32 pairs |
maxps |
Element-wise max (Float32) |
shufps |
Shuffle Float32 lanes (for reduction) |
NEON (AArch64)
Instruction | Description |
---|---|
ld1 |
Load vector (e.g. v0.4s , v0.8h ) |
st1 |
Store vector |
add |
Add 4 × Int32 |
sqadd |
Saturated add 8 × Int16 |
fadd |
Add 4 × Float32 |
fmul |
Multiply 4 × Float32 |
faddp |
Pairwise add (Float32 reduction) |
fmaxp |
Pairwise max (Float32 reduction) |
faddv |
Vector-wide add (optional) |
fmaxv |
Vector-wide max (optional) |
Notes
- SSE's
movaps
andmovdqa
require 16-byte alignment. - NEON's
faddp
,fmaxp
reduce in two steps: 4 → 2 → 1. -
shufps
is used with masks like0x4E
,0x01
for reordering lanes during reduction. - Saturated arithmetic (
paddsw
,sqadd
) clamps values on overflow.
Thanks for reading — and happy crystaling! 💎
Top comments (0)