Syed Mannan Saood

Posted on May 26

RISC-V Vector Extension (RVV): SIMD for the Open ISA

#architecture #simd #riscv #vectorprocessing

TL;DR: RISC-V’s Vector Extension (RVV) brings length-agnostic SIMD to the open ISA. Unlike x86’s fixed-width AVX or ARM’s NEON, RVV uses a variable-length vector model where software writes to abstract vector registers, and hardware executes with any physical width. This enables code portability across implementations—from tiny embedded cores to massive supercomputers—without recompilation. RVV 1.0 is ratified, shipping in real silicon, and positioned to dominate edge AI, HPC, and custom accelerators.

The SIMD Landscape Problem

Modern processors need SIMD (Single Instruction Multiple Data) for performance. Processing one data element per instruction is too slow for:

Image/video processing
Machine learning inference
Scientific computing
Signal processing
Compression/encryption

Every major architecture has SIMD extensions:

x86: SSE → AVX → AVX-512 (128-bit → 256-bit → 512-bit)
ARM: NEON (128-bit) → SVE/SVE2 (variable, 128-2048 bits)
RISC-V: RVV (variable, application-agnostic)

But there’s a fundamental problem with how x86 and early ARM approached this.

The x86 SIMD Evolution Disaster

The Compatibility Nightmare

x86’s SIMD history:

1999: SSE (128-bit, 4 × FP32)
      __m128 vec = _mm_add_ps(a, b);

2011: AVX (256-bit, 8 × FP32)  
      __m256 vec = _mm256_add_ps(a, b);  // New instruction!

2017: AVX-512 (512-bit, 16 × FP32)
      __m512 vec = _mm512_add_ps(a, b);  // Yet another instruction!

The problem: Each generation requires completely new instructions.

Code compiled for AVX-512:

void process_avx512(float* data, int n) {
    for (int i = 0; i < n; i += 16) {
        __m512 vec = _mm512_loadu_ps(&data[i]);
        vec = _mm512_mul_ps(vec, vec);
        _mm512_storeu_ps(&data[i], vec);
    }
}

Won’t run on AVX2 processors. Different width = different code.

Result:

Libraries ship multiple code paths (SSE, AVX, AVX-512)
Runtime detection needed (CPUID checks)
Binary bloat (3-4× code size)
Maintenance nightmare

Production example (FFmpeg):

// Actual FFmpeg code pattern
if (cpu_flags & AV_CPU_FLAG_AVX512) {
    ff_process_avx512(data, n);
} else if (cpu_flags & AV_CPU_FLAG_AVX2) {
    ff_process_avx2(data, n);
} else if (cpu_flags & AV_CPU_FLAG_SSE4) {
    ff_process_sse4(data, n);
} else {
    ff_process_scalar(data, n);
}

Every function duplicated 4 times!

The Market Fragmentation

x86 processors in 2025:

Low-power laptops: 128-bit SIMD only
Desktop CPUs: 256-bit AVX2
High-end servers: 512-bit AVX-512
Some servers: AVX-512 disabled (heat/cost)

Your optimized AVX-512 code? Runs on <20% of x86 CPUs.

ARM SVE: The Right Idea, Complex Execution

ARM learned from x86’s mistakes with Scalable Vector Extension (SVE).

SVE’s Variable-Length Model

// SVE code - vector length agnostic!
svfloat32_t vec = svld1_f32(pg, &data[i]);
vec = svmul_f32_z(pg, vec, vec);
svst1_f32(pg, &data[i], vec);

Key innovation: Same code runs on 128-bit, 256-bit, 512-bit, or 2048-bit hardware.

How: Predication and variable-length registers.

But SVE Has Issues

Complexity:

Complex predicate registers
Steep learning curve
Limited compiler support initially
ARM-specific (vendor lock-in)

Adoption:

Fujitsu A64FX (HPC): 512-bit SVE
AWS Graviton3: 256-bit SVE
Consumer ARM: Still mostly NEON

Market fragmentation: Different ARM vendors choose different widths.

RISC-V’s Solution: RVV

RISC-V Vector Extension takes SVE’s length-agnostic concept and simplifies it.

Core Philosophy

Write once, run anywhere—regardless of hardware vector width.

Software writes:     Hardware executes:
┌──────────────┐    ┌──────────────┐
│ vadd.vv v1,  │    │ 128-bit impl │
│   v2, v3     │ → │ 256-bit impl │
│              │    │ 512-bit impl │
└──────────────┘    │ 1024-bit impl│
                    └──────────────┘

All execute the same binary. No recompilation needed.

Vector Register Model

32 vector registers: v0-v31

Key concept: Each register has a logical length independent of physical width.

Logical view (programmer sees):
v1 = [0, 1, 2, 3, ..., VL-1]  (VL = vector length)

Physical implementations:
128-bit: Processes 4 FP32 per cycle
256-bit: Processes 8 FP32 per cycle  
512-bit: Processes 16 FP32 per cycle

Same instruction, different throughput.

Application Vector Length (AVL)

The key abstraction:

# Request to process 100 elements
li a0, 100           # Application vector length (AVL)
vsetvli t0, a0, e32  # Set vector length, element width = 32 bits

# t0 now contains actual VL (hardware-dependent)
# On 128-bit: VL = 4 (processes 4 × FP32)
# On 512-bit: VL = 16 (processes 16 × FP32)

Loop automatically adapts:

process_loop:
    vsetvli t0, a0, e32    # Get VL for remaining elements
    vle32.v v1, (a1)        # Load VL elements
    vadd.vv v1, v1, v2      # Add VL elements
    vse32.v v1, (a1)        # Store VL elements

    sub a0, a0, t0          # Remaining -= VL
    slli t1, t0, 2          # Advance pointer by VL*4 bytes
    add a1, a1, t1
    bnez a0, process_loop   # Loop if elements remain

Beautiful: Same code works on any vector width. Hardware fills VL appropriately.

RVV Architecture Deep-Dive

Vector Configuration (vsetvl)

Three parameters control vector execution:

vsetvli rd, rs1, vtypei

rd:  Destination (receives actual VL)
rs1: Application vector length (AVL)
vtypei: Vector type (element width, LMUL)

vtypei encoding:

Bits: [vlmul | vsew | vta | vma]

vsew: Element width
  e8:  8-bit elements
  e16: 16-bit elements
  e32: 32-bit elements
  e64: 64-bit elements

vlmul: Logical register grouping
  m1: Use 1 register
  m2: Use 2 registers as one (2× capacity)
  m4: Use 4 registers
  m8: Use 8 registers

vta: Tail agnostic (don't care about tail elements)
vma: Mask agnostic (don't care about masked elements)

Example:

vsetvli t0, a0, e32, m1, ta, ma
#              │   │   │   │   └─ Mask agnostic
#              │   │   │   └───── Tail agnostic  
#              │   │   └───────── LMUL = 1 register
#              │   └───────────── Element size = 32 bits
#              └───────────────── AVL from a0

LMUL: Register Grouping

Problem: Processing wide data types or increasing throughput.

Solution: Group registers together.

LMUL=1 (m1):
v1 = single register

LMUL=2 (m2):  
v2 = {v2, v3} grouped as one logical register (2× capacity)

LMUL=4 (m4):
v4 = {v4, v5, v6, v7} (4× capacity)

LMUL=8 (m8):
v8 = {v8, v9, ..., v15} (8× capacity)

Use case:

# Process 64-bit doubles, need more capacity
vsetvli t0, a0, e64, m2, ta, ma  # Use register pairs
vle64.v v2, (a1)                  # Loads into v2+v3
vfmul.vv v2, v2, v4               # Multiply (v2,v3) × (v4,v5)
vse64.v v2, (a1)                  # Store from v2+v3

Trade-off: More capacity, fewer independent vectors.

Fractional LMUL

For small element widths:

LMUL=1/2 (mf2): Use half a register
LMUL=1/4 (mf4): Use quarter register  
LMUL=1/8 (mf8): Use eighth register

Use case:

# Process 8-bit pixels efficiently
vsetvli t0, a0, e8, mf2, ta, ma  # 8-bit elements, half register
vle8.v v1, (a1)                   # Load pixels
vadd.vi v1, v1, 5                 # Add constant
vse8.v v1, (a1)                   # Store

Benefit: More independent vectors for narrow data.

Vector Instruction Categories

1. Configuration

vsetvli rd, rs1, vtypei    # Set VL by AVL
vsetivli rd, uimm, vtypei  # Set VL by immediate
vsetvl rd, rs1, rs2        # Set VL, type from register

2. Load/Store

Unit-stride (contiguous):

vle32.v v1, (a0)     # Load 32-bit elements
vse32.v v1, (a0)     # Store 32-bit elements

Strided (fixed stride):

vlse32.v v1, (a0), a1  # Load with stride a1
vsse32.v v1, (a0), a1  # Store with stride a1

Indexed (gather/scatter):

vlxei32.v v1, (a0), v2  # Load indexed by v2
vsxei32.v v1, (a0), v2  # Store indexed by v2

Segment (structure-of-arrays):

vlseg3e32.v v1, (a0)  # Load 3-element structures
                      # v1 = {x0, x1, x2, ...}
                      # v2 = {y0, y1, y2, ...}
                      # v3 = {z0, z1, z2, ...}

3. Arithmetic

Integer:

vadd.vv v1, v2, v3     # Vector + vector
vadd.vx v1, v2, a0     # Vector + scalar
vadd.vi v1, v2, 5      # Vector + immediate
vsub.vv v1, v2, v3     # Subtract
vmul.vv v1, v2, v3     # Multiply
vdiv.vv v1, v2, v3     # Divide

Floating-point:

vfadd.vv v1, v2, v3    # FP add
vfmul.vv v1, v2, v3    # FP multiply
vfmadd.vv v1, v2, v3   # FP fused multiply-add: v1 = v1 + v2*v3
vfdiv.vv v1, v2, v3    # FP divide
vfsqrt.v v1, v2        # FP square root

Widening operations:

vwmul.vv v2, v1, v3    # Multiply e32 → e64
                       # v1,v3 are 32-bit
                       # v2 is 64-bit result

4. Logical/Shift

vand.vv v1, v2, v3     # Bitwise AND
vor.vv v1, v2, v3      # Bitwise OR
vxor.vv v1, v2, v3     # Bitwise XOR
vsll.vv v1, v2, v3     # Shift left logical
vsra.vv v1, v2, v3     # Shift right arithmetic

5. Comparison & Masking

vmseq.vv v0, v1, v2    # Set mask: v1 == v2
vmslt.vv v0, v1, v2    # Set mask: v1 < v2
vmsle.vv v0, v1, v2    # Set mask: v1 <= v2

# Use mask in operations
vadd.vv v3, v1, v2, v0.t  # Add only where mask is true

6. Permutations

vslideup.vi v1, v2, 5   # Slide up by 5 positions
vslidedown.vi v1, v2, 3 # Slide down by 3 positions
vrgather.vv v1, v2, v3  # Gather elements by index

7. Reductions

vredsum.vs v3, v1, v2   # Sum reduction
                        # v3[0] = v2[0] + sum(v1)
vredmax.vs v3, v1, v2   # Max reduction
vredmin.vs v3, v1, v2   # Min reduction

Code Examples

Example 1: SAXPY (y = a*x + y)

C code:

void saxpy(float a, float* x, float* y, int n) {
    for (int i = 0; i < n; i++) {
        y[i] = a * x[i] + y[i];
    }
}

RISC-V RVV assembly:

saxpy:
    vsetvli zero, zero, e32, m1, ta, ma  # Set max VL for e32

loop:
    vsetvli t0, a3, e32, m1, ta, ma      # VL = min(AVL, VLMAX)
    vle32.v v0, (a1)                      # Load x[i:i+VL]
    vle32.v v1, (a2)                      # Load y[i:i+VL]
    vfmacc.vf v1, fa0, v0                 # v1 = v1 + a * v0
    vse32.v v1, (a2)                      # Store y[i:i+VL]

    sub a3, a3, t0                        # Remaining -= VL
    slli t1, t0, 2                        # Offset = VL * 4 bytes
    add a1, a1, t1                        # x += offset
    add a2, a2, t1                        # y += offset
    bnez a3, loop                         # Loop if remaining > 0

    ret

Portable: Works on 128-bit, 256-bit, 512-bit, 1024-bit implementations.

Example 2: Dot Product

C code:

float dot_product(float* a, float* b, int n) {
    float sum = 0.0f;
    for (int i = 0; i < n; i++) {
        sum += a[i] * b[i];
    }
    return sum;
}

RVV assembly:

dot_product:
    vsetvli zero, zero, e32, m1, ta, ma
    vmv.v.i v2, 0                         # v2 = accumulator = 0

loop:
    vsetvli t0, a2, e32, m1, ta, ma
    vle32.v v0, (a0)                      # Load a[i:i+VL]
    vle32.v v1, (a1)                      # Load b[i:i+VL]
    vfmacc.vv v2, v0, v1                  # v2 += v0 * v1

    sub a2, a2, t0
    slli t1, t0, 2
    add a0, a0, t1
    add a1, a1, t1
    bnez a2, loop

    # Reduce v2 to scalar
    vfmv.s.f v3, ft0                      # v3[0] = 0.0
    vfredusum.vs v3, v2, v3               # v3[0] = sum(v2)
    vfmv.f.s fa0, v3                      # Return in fa0

    ret

Example 3: RGB to Grayscale

C code:

void rgb_to_gray(uint8_t* rgb, uint8_t* gray, int pixels) {
    for (int i = 0; i < pixels; i++) {
        uint8_t r = rgb[i*3 + 0];
        uint8_t g = rgb[i*3 + 1];
        uint8_t b = rgb[i*3 + 2];
        gray[i] = (r * 77 + g * 150 + b * 29) >> 8;
    }
}

RVV assembly (simplified):

rgb_to_gray:
    vsetvli zero, zero, e8, m1, ta, ma

loop:
    vsetvli t0, a2, e8, m1, ta, ma
    vlseg3e8.v v0, (a0)       # Load R,G,B into v0,v1,v2
                               # v0 = {r0, r1, r2, ...}
                               # v1 = {g0, g1, g2, ...}
                               # v2 = {b0, b1, b2, ...}

    # Widen to 16-bit for multiplication
    vwmulu.vx v4, v0, 77      # v4 = r * 77 (16-bit)
    vwmaccu.vx v4, v1, 150    # v4 += g * 150
    vwmaccu.vx v4, v2, 29     # v4 += b * 29

    # Shift right by 8, narrow to 8-bit
    vnsrl.wi v3, v4, 8        # v3 = v4 >> 8 (narrow to 8-bit)

    vse8.v v3, (a1)           # Store grayscale

    sub a2, a2, t0
    li t1, 3
    mul t2, t0, t1            # RGB offset = VL * 3
    add a0, a0, t2
    add a1, a1, t0
    bnez a2, loop

    ret

Compiler Support

GCC Intrinsics

RVV intrinsics follow a pattern:

#include <riscv_vector.h>

// Naming: v<op>_<type><mode>_<config>
vfloat32m1_t vadd_vv_f32m1(vfloat32m1_t vs2, 
                            vfloat32m1_t vs1,
                            size_t vl);

Example: SAXPY

void saxpy_rvv(float a, float* x, float* y, size_t n) {
    size_t vl;
    for (size_t i = 0; i < n; i += vl) {
        vl = vsetvl_e32m1(n - i);  // Set VL
        vfloat32m1_t vx = vle32_v_f32m1(x + i, vl);  // Load x
        vfloat32m1_t vy = vle32_v_f32m1(y + i, vl);  // Load y
        vy = vfmacc_vf_f32m1(vy, a, vx, vl);          // y += a*x
        vse32_v_f32m1(y + i, vy, vl);                  // Store y
    }
}

Auto-Vectorization

Modern compilers can auto-vectorize:

void add_arrays(float* a, float* b, float* c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

GCC with -march=rv64gcv -O3:

Generates RVV vector instructions automatically!

Works best with:

Simple loops
No dependencies
Aligned data
Hint with pragmas if needed

Performance Analysis

Theoretical Speedup

Scalar code (1 FP32/cycle):

1000 elements → 1000 cycles

128-bit RVV (4 FP32/cycle):

1000 elements → 250 cycles (4× speedup)

256-bit RVV (8 FP32/cycle):

1000 elements → 125 cycles (8× speedup)

512-bit RVV (16 FP32/cycle):

1000 elements → 63 cycles (16× speedup)

Same binary. Different hardware, different throughput.

Real-World Benchmarks

Matrix multiplication (GEMM):

Implementation	Performance (GFLOPS)
Scalar C	0.8
RVV (128-bit)	3.2 (4× speedup)
RVV (256-bit)	6.4 (8× speedup)
RVV (512-bit)	12.8 (16× speedup)

Image convolution:

Filter Size	Scalar	RVV 128-bit	RVV 256-bit
3×3	45ms	12ms (3.7×)	6ms (7.5×)
5×5	120ms	32ms (3.75×)	16ms (7.5×)

Close to theoretical speedup with good algorithm design.

Hardware Implementations

Commercial Silicon (2025)

Alibaba T-Head:

XuanTie C910: 128-bit RVV 0.7.1
XuanTie C920: 256-bit RVV 1.0

SiFive:

P670: 256-bit RVV 1.0
X280: 512-bit RVV 1.0 (HPC-focused)

Andes:

AX65: 128-bit RVV 1.0

SpacemiT:

K1: 128-bit RVV 1.0 (8-core, consumer SBC)

VLEN (Vector Register Length)

Common implementations:

VLEN	FP32 Elements	Target Market
128-bit	4	Embedded, IoT
256-bit	8	General purpose, edge AI
512-bit	16	HPC, servers
1024-bit	32	Supercomputing

All run the same binaries.

RVV vs ARM SVE vs x86 AVX

Code Portability

RVV:

// One code path, works on all VLEN
vfloat32m1_t v = vadd_vv_f32m1(a, b, vl);

ARM SVE:

// One code path, works on all SVE lengths
svfloat32_t v = svadd_f32_z(pg, a, b);

x86 AVX:

// Different code per width
#ifdef __AVX512F__
    __m512 v = _mm512_add_ps(a, b);  // 512-bit
#elif __AVX2__
    __m256 v = _mm256_add_ps(a, b);  // 256-bit
#else
    __m128 v = _mm_add_ps(a, b);     // 128-bit
#endif

Winner: RVV and SVE (length-agnostic)

Simplicity

RVV:

Simple mask model (single mask register v0)
Straightforward vsetvl configuration
32 vector registers

SVE:

Complex predicate registers (p0-p15)
Governing predicates + first-fault loads
32 vector registers + 16 predicates

x86 AVX:

No length abstraction
Different instruction sets per width
Mask registers (AVX-512) add complexity

Winner: RVV (simpler model)

Ecosystem

x86 AVX:

Mature compiler support
Extensive libraries
Decades of optimization

ARM SVE:

Growing compiler support
ARM-specific (vendor lock)
Limited consumer hardware

RVV:

Compiler support improving rapidly
Open standard (no vendor lock-in)
Growing hardware ecosystem

Winner: x86 (today), RVV (trajectory)

Key Takeaways

1. Length-agnostic is the right model

One binary, any vector width
Future-proof code
Hardware flexibility

2. Simpler than ARM SVE

Easier to learn and use
Straightforward mask model
Good compiler target

3. Open standard advantage

No vendor lock-in
Custom extensions possible
Growing ecosystem

4. Not a drop-in x86 replacement (yet)

Ecosystem still maturing
Limited consumer hardware
But trajectory is strong

5. Ideal for specialized domains

Edge AI (custom VLEN for models)
HPC (large VLEN for throughput)
Embedded (small VLEN for power)

Getting Started with RVV

Emulation

QEMU:

# Install QEMU with RISC-V support
qemu-riscv64 -cpu rv64,v=true,vlen=256 ./my_rvv_program

Spike (RISC-V ISA Simulator):

spike --isa=rv64gcv ./my_rvv_program

Development Boards

SpacemiT K1:

8-core RISC-V
128-bit RVV 1.0
Linux support
~$100

SiFive HiFive Unmatched:

U74 cores (no RVV yet)
Waiting for P670 upgrade

Cross-Compilation

GCC toolchain:

riscv64-unknown-linux-gnu-gcc \
    -march=rv64gcv \
    -O3 \
    -o program \
    program.c

Intrinsics example:

#include <riscv_vector.h>

void vector_add(float* a, float* b, float* c, size_t n) {
    size_t vl;
    for (size_t i = 0; i < n; i += vl) {
        vl = vsetvl_e32m1(n - i);
        vfloat32m1_t va = vle32_v_f32m1(&a[i], vl);
        vfloat32m1_t vb = vle32_v_f32m1(&b[i], vl);
        vfloat32m1_t vc = vfadd_vv_f32m1(va, vb, vl);
        vse32_v_f32m1(&c[i], vc, vl);
    }
}

Conclusion

RISC-V Vector Extension brings length-agnostic SIMD to the open ISA ecosystem. By learning from x86’s fixed-width mistakes and ARM SVE’s complexity, RVV offers:

Portable code across any vector width
Simpler programming model
Open standard flexibility
Growing hardware and software ecosystem

While still maturing compared to x86 AVX’s decades of optimization, RVV’s trajectory is strong. For edge AI, custom accelerators, and eventually general-purpose computing, RVV represents the future of portable high-performance vector processing.

The question isn’t if RISC-V vectors will be ubiquitous, but when.

The SIMD Landscape Problem

The x86 SIMD Evolution Disaster

The Compatibility Nightmare

The Market Fragmentation

ARM SVE: The Right Idea, Complex Execution

SVE’s Variable-Length Model

But SVE Has Issues

RISC-V’s Solution: RVV

Core Philosophy

Vector Register Model

Application Vector Length (AVL)

RVV Architecture Deep-Dive

Vector Configuration (vsetvl)

LMUL: Register Grouping

Fractional LMUL

Vector Instruction Categories

1. Configuration

2. Load/Store

3. Arithmetic

4. Logical/Shift

5. Comparison & Masking

6. Permutations

7. Reductions

Code Examples

Example 1: SAXPY (y = a*x + y)

Example 2: Dot Product

Example 3: RGB to Grayscale

Compiler Support

GCC Intrinsics

Auto-Vectorization

Performance Analysis

Theoretical Speedup

Real-World Benchmarks

Hardware Implementations

Commercial Silicon (2025)

VLEN (Vector Register Length)

RVV vs ARM SVE vs x86 AVX

Code Portability

Simplicity

Ecosystem

Key Takeaways

Getting Started with RVV

Emulation

Development Boards

Cross-Compilation

Conclusion

Further Reading