<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Syed Mannan Saood</title>
    <description>The latest articles on DEV Community by Syed Mannan Saood (@mannansaood_83).</description>
    <link>https://dev.to/mannansaood_83</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1113444%2F637d09fe-0f9c-42a8-ae9c-a6dd2a87c51e.png</url>
      <title>DEV Community: Syed Mannan Saood</title>
      <link>https://dev.to/mannansaood_83</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mannansaood_83"/>
    <language>en</language>
    <item>
      <title>RISC-V Vector Extension (RVV): SIMD for the Open ISA</title>
      <dc:creator>Syed Mannan Saood</dc:creator>
      <pubDate>Tue, 26 May 2026 09:18:19 +0000</pubDate>
      <link>https://dev.to/mannansaood_83/risc-v-vector-extension-rvv-simd-for-the-open-isa-3aon</link>
      <guid>https://dev.to/mannansaood_83/risc-v-vector-extension-rvv-simd-for-the-open-isa-3aon</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; RISC-V’s Vector Extension (RVV) brings length-agnostic SIMD to the open ISA. Unlike x86’s fixed-width AVX or ARM’s NEON, RVV uses a variable-length vector model where software writes to abstract vector registers, and hardware executes with any physical width. This enables code portability across implementations—from tiny embedded cores to massive supercomputers—without recompilation. RVV 1.0 is ratified, shipping in real silicon, and positioned to dominate edge AI, HPC, and custom accelerators.&lt;/p&gt;




&lt;h2&gt;
  
  
  The SIMD Landscape Problem
&lt;/h2&gt;

&lt;p&gt;Modern processors need SIMD (Single Instruction Multiple Data) for performance. Processing one data element per instruction is too slow for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Image/video processing&lt;/li&gt;
&lt;li&gt;Machine learning inference&lt;/li&gt;
&lt;li&gt;Scientific computing&lt;/li&gt;
&lt;li&gt;Signal processing&lt;/li&gt;
&lt;li&gt;Compression/encryption&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every major architecture has SIMD extensions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;x86:&lt;/strong&gt; SSE → AVX → AVX-512 (128-bit → 256-bit → 512-bit)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ARM:&lt;/strong&gt; NEON (128-bit) → SVE/SVE2 (variable, 128-2048 bits)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RISC-V:&lt;/strong&gt; RVV (variable, application-agnostic)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But there’s a fundamental problem with how x86 and early ARM approached this.&lt;/p&gt;




&lt;h2&gt;
  
  
  The x86 SIMD Evolution Disaster
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Compatibility Nightmare
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;x86’s SIMD history:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1999: SSE (128-bit, 4 × FP32)
      __m128 vec = _mm_add_ps(a, b);

2011: AVX (256-bit, 8 × FP32)  
      __m256 vec = _mm256_add_ps(a, b);  // New instruction!

2017: AVX-512 (512-bit, 16 × FP32)
      __m512 vec = _mm512_add_ps(a, b);  // Yet another instruction!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The problem:&lt;/strong&gt; Each generation requires &lt;strong&gt;completely new instructions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code compiled for AVX-512:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;process_avx512&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;__m512&lt;/span&gt; &lt;span class="n"&gt;vec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm512_loadu_ps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
        &lt;span class="n"&gt;vec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm512_mul_ps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vec&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;_mm512_storeu_ps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;vec&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Won’t run on AVX2 processors.&lt;/strong&gt; Different width = different code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Libraries ship multiple code paths (SSE, AVX, AVX-512)&lt;/li&gt;
&lt;li&gt;Runtime detection needed (CPUID checks)&lt;/li&gt;
&lt;li&gt;Binary bloat (3-4× code size)&lt;/li&gt;
&lt;li&gt;Maintenance nightmare&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Production example (FFmpeg):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Actual FFmpeg code pattern&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cpu_flags&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;AV_CPU_FLAG_AVX512&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ff_process_avx512&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cpu_flags&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;AV_CPU_FLAG_AVX2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ff_process_avx2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cpu_flags&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;AV_CPU_FLAG_SSE4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ff_process_sse4&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;ff_process_scalar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every function duplicated 4 times!&lt;/p&gt;

&lt;h3&gt;
  
  
  The Market Fragmentation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;x86 processors in 2025:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Low-power laptops: 128-bit SIMD only&lt;/li&gt;
&lt;li&gt;Desktop CPUs: 256-bit AVX2&lt;/li&gt;
&lt;li&gt;High-end servers: 512-bit AVX-512&lt;/li&gt;
&lt;li&gt;Some servers: AVX-512 disabled (heat/cost)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Your optimized AVX-512 code?&lt;/strong&gt; Runs on &amp;lt;20% of x86 CPUs.&lt;/p&gt;




&lt;h2&gt;
  
  
  ARM SVE: The Right Idea, Complex Execution
&lt;/h2&gt;

&lt;p&gt;ARM learned from x86’s mistakes with &lt;strong&gt;Scalable Vector Extension (SVE)&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  SVE’s Variable-Length Model
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;// SVE code - vector length agnostic!&lt;/span&gt;
&lt;span class="n"&gt;svfloat32_t&lt;/span&gt; &lt;span class="n"&gt;vec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;svld1_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
&lt;span class="n"&gt;vec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;svmul_f32_z&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vec&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;svst1_f32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;vec&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key innovation:&lt;/strong&gt; Same code runs on 128-bit, 256-bit, 512-bit, or 2048-bit hardware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How:&lt;/strong&gt; Predication and variable-length registers.&lt;/p&gt;

&lt;h3&gt;
  
  
  But SVE Has Issues
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Complexity:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complex predicate registers&lt;/li&gt;
&lt;li&gt;Steep learning curve&lt;/li&gt;
&lt;li&gt;Limited compiler support initially&lt;/li&gt;
&lt;li&gt;ARM-specific (vendor lock-in)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Adoption:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fujitsu A64FX (HPC): 512-bit SVE&lt;/li&gt;
&lt;li&gt;AWS Graviton3: 256-bit SVE&lt;/li&gt;
&lt;li&gt;Consumer ARM: Still mostly NEON&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Market fragmentation:&lt;/strong&gt; Different ARM vendors choose different widths.&lt;/p&gt;




&lt;h2&gt;
  
  
  RISC-V’s Solution: RVV
&lt;/h2&gt;

&lt;p&gt;RISC-V Vector Extension takes SVE’s length-agnostic concept and simplifies it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Core Philosophy
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Write once, run anywhere—regardless of hardware vector width.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Software writes:     Hardware executes:
┌──────────────┐    ┌──────────────┐
│ vadd.vv v1,  │    │ 128-bit impl │
│   v2, v3     │ → │ 256-bit impl │
│              │    │ 512-bit impl │
└──────────────┘    │ 1024-bit impl│
                    └──────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;All execute the same binary.&lt;/strong&gt; No recompilation needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Vector Register Model
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;32 vector registers:&lt;/strong&gt; v0-v31&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key concept:&lt;/strong&gt; Each register has a &lt;strong&gt;logical length&lt;/strong&gt; independent of physical width.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Logical view (programmer sees):
v1 = [0, 1, 2, 3, ..., VL-1]  (VL = vector length)

Physical implementations:
128-bit: Processes 4 FP32 per cycle
256-bit: Processes 8 FP32 per cycle  
512-bit: Processes 16 FP32 per cycle
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Same instruction, different throughput.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Application Vector Length (AVL)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The key abstraction:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Request to process 100 elements
li a0, 100           # Application vector length (AVL)
vsetvli t0, a0, e32  # Set vector length, element width = 32 bits

# t0 now contains actual VL (hardware-dependent)
# On 128-bit: VL = 4 (processes 4 × FP32)
# On 512-bit: VL = 16 (processes 16 × FP32)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Loop automatically adapts:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;process_loop:
    vsetvli t0, a0, e32    # Get VL for remaining elements
    vle32.v v1, (a1)        # Load VL elements
    vadd.vv v1, v1, v2      # Add VL elements
    vse32.v v1, (a1)        # Store VL elements

    sub a0, a0, t0          # Remaining -= VL
    slli t1, t0, 2          # Advance pointer by VL*4 bytes
    add a1, a1, t1
    bnez a0, process_loop   # Loop if elements remain
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Beautiful:&lt;/strong&gt; Same code works on any vector width. Hardware fills VL appropriately.&lt;/p&gt;




&lt;h2&gt;
  
  
  RVV Architecture Deep-Dive
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Vector Configuration (vsetvl)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Three parameters control vector execution:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="n"&gt;vsetvli&lt;/span&gt; &lt;span class="n"&gt;rd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rs1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vtypei&lt;/span&gt;

&lt;span class="n"&gt;rd&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;Destination&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;receives&lt;/span&gt; &lt;span class="n"&gt;actual&lt;/span&gt; &lt;span class="n"&gt;VL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;rs1&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Application&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt; &lt;span class="n"&gt;length&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AVL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;vtypei&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Vector&lt;/span&gt; &lt;span class="n"&gt;type&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;element&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LMUL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;vtypei encoding:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Bits: [vlmul | vsew | vta | vma]

vsew: Element width
  e8:  8-bit elements
  e16: 16-bit elements
  e32: 32-bit elements
  e64: 64-bit elements

vlmul: Logical register grouping
  m1: Use 1 register
  m2: Use 2 registers as one (2× capacity)
  m4: Use 4 registers
  m8: Use 8 registers

vta: Tail agnostic (don't care about tail elements)
vma: Mask agnostic (don't care about masked elements)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vsetvli t0, a0, e32, m1, ta, ma
#              │   │   │   │   └─ Mask agnostic
#              │   │   │   └───── Tail agnostic  
#              │   │   └───────── LMUL = 1 register
#              │   └───────────── Element size = 32 bits
#              └───────────────── AVL from a0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  LMUL: Register Grouping
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem:&lt;/strong&gt; Processing wide data types or increasing throughput.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt; Group registers together.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LMUL=1 (m1):
v1 = single register

LMUL=2 (m2):  
v2 = {v2, v3} grouped as one logical register (2× capacity)

LMUL=4 (m4):
v4 = {v4, v5, v6, v7} (4× capacity)

LMUL=8 (m8):
v8 = {v8, v9, ..., v15} (8× capacity)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Use case:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Process 64-bit doubles, need more capacity
vsetvli t0, a0, e64, m2, ta, ma  # Use register pairs
vle64.v v2, (a1)                  # Loads into v2+v3
vfmul.vv v2, v2, v4               # Multiply (v2,v3) × (v4,v5)
vse64.v v2, (a1)                  # Store from v2+v3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Trade-off:&lt;/strong&gt; More capacity, fewer independent vectors.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fractional LMUL
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;For small element widths:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LMUL=1/2 (mf2): Use half a register
LMUL=1/4 (mf4): Use quarter register  
LMUL=1/8 (mf8): Use eighth register
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Use case:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Process 8-bit pixels efficiently
vsetvli t0, a0, e8, mf2, ta, ma  # 8-bit elements, half register
vle8.v v1, (a1)                   # Load pixels
vadd.vi v1, v1, 5                 # Add constant
vse8.v v1, (a1)                   # Store
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Benefit:&lt;/strong&gt; More independent vectors for narrow data.&lt;/p&gt;




&lt;h2&gt;
  
  
  Vector Instruction Categories
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Configuration
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vsetvli rd, rs1, vtypei    # Set VL by AVL
vsetivli rd, uimm, vtypei  # Set VL by immediate
vsetvl rd, rs1, rs2        # Set VL, type from register
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Load/Store
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Unit-stride (contiguous):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vle32.v v1, (a0)     # Load 32-bit elements
vse32.v v1, (a0)     # Store 32-bit elements
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Strided (fixed stride):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vlse32.v v1, (a0), a1  # Load with stride a1
vsse32.v v1, (a0), a1  # Store with stride a1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Indexed (gather/scatter):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vlxei32.v v1, (a0), v2  # Load indexed by v2
vsxei32.v v1, (a0), v2  # Store indexed by v2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Segment (structure-of-arrays):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vlseg3e32.v v1, (a0)  # Load 3-element structures
                      # v1 = {x0, x1, x2, ...}
                      # v2 = {y0, y1, y2, ...}
                      # v3 = {z0, z1, z2, ...}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Arithmetic
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Integer:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vadd.vv v1, v2, v3     # Vector + vector
vadd.vx v1, v2, a0     # Vector + scalar
vadd.vi v1, v2, 5      # Vector + immediate
vsub.vv v1, v2, v3     # Subtract
vmul.vv v1, v2, v3     # Multiply
vdiv.vv v1, v2, v3     # Divide
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Floating-point:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vfadd.vv v1, v2, v3    # FP add
vfmul.vv v1, v2, v3    # FP multiply
vfmadd.vv v1, v2, v3   # FP fused multiply-add: v1 = v1 + v2*v3
vfdiv.vv v1, v2, v3    # FP divide
vfsqrt.v v1, v2        # FP square root
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Widening operations:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vwmul.vv v2, v1, v3    # Multiply e32 → e64
                       # v1,v3 are 32-bit
                       # v2 is 64-bit result
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Logical/Shift
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vand.vv v1, v2, v3     # Bitwise AND
vor.vv v1, v2, v3      # Bitwise OR
vxor.vv v1, v2, v3     # Bitwise XOR
vsll.vv v1, v2, v3     # Shift left logical
vsra.vv v1, v2, v3     # Shift right arithmetic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. Comparison &amp;amp; Masking
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vmseq.vv v0, v1, v2    # Set mask: v1 == v2
vmslt.vv v0, v1, v2    # Set mask: v1 &amp;lt; v2
vmsle.vv v0, v1, v2    # Set mask: v1 &amp;lt;= v2

# Use mask in operations
vadd.vv v3, v1, v2, v0.t  # Add only where mask is true
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  6. Permutations
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vslideup.vi v1, v2, 5   # Slide up by 5 positions
vslidedown.vi v1, v2, 3 # Slide down by 3 positions
vrgather.vv v1, v2, v3  # Gather elements by index
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  7. Reductions
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;vredsum.vs v3, v1, v2   # Sum reduction
                        # v3[0] = v2[0] + sum(v1)
vredmax.vs v3, v1, v2   # Max reduction
vredmin.vs v3, v1, v2   # Min reduction
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Code Examples
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Example 1: SAXPY (y = a*x + y)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;C code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;saxpy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;RISC-V RVV assembly:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;saxpy:
    vsetvli zero, zero, e32, m1, ta, ma  # Set max VL for e32

loop:
    vsetvli t0, a3, e32, m1, ta, ma      # VL = min(AVL, VLMAX)
    vle32.v v0, (a1)                      # Load x[i:i+VL]
    vle32.v v1, (a2)                      # Load y[i:i+VL]
    vfmacc.vf v1, fa0, v0                 # v1 = v1 + a * v0
    vse32.v v1, (a2)                      # Store y[i:i+VL]

    sub a3, a3, t0                        # Remaining -= VL
    slli t1, t0, 2                        # Offset = VL * 4 bytes
    add a1, a1, t1                        # x += offset
    add a2, a2, t1                        # y += offset
    bnez a3, loop                         # Loop if remaining &amp;gt; 0

    ret
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Portable:&lt;/strong&gt; Works on 128-bit, 256-bit, 512-bit, 1024-bit implementations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example 2: Dot Product
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;C code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="nf"&gt;dot_product&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;RVV assembly:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dot_product:
    vsetvli zero, zero, e32, m1, ta, ma
    vmv.v.i v2, 0                         # v2 = accumulator = 0

loop:
    vsetvli t0, a2, e32, m1, ta, ma
    vle32.v v0, (a0)                      # Load a[i:i+VL]
    vle32.v v1, (a1)                      # Load b[i:i+VL]
    vfmacc.vv v2, v0, v1                  # v2 += v0 * v1

    sub a2, a2, t0
    slli t1, t0, 2
    add a0, a0, t1
    add a1, a1, t1
    bnez a2, loop

    # Reduce v2 to scalar
    vfmv.s.f v3, ft0                      # v3[0] = 0.0
    vfredusum.vs v3, v2, v3               # v3[0] = sum(v2)
    vfmv.f.s fa0, v3                      # Return in fa0

    ret
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Example 3: RGB to Grayscale
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;C code:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;rgb_to_gray&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;rgb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;gray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;pixels&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;pixels&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kt"&gt;uint8_t&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rgb&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="kt"&gt;uint8_t&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rgb&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="kt"&gt;uint8_t&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rgb&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="n"&gt;gray&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;77&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;150&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;29&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;RVV assembly (simplified):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rgb_to_gray:
    vsetvli zero, zero, e8, m1, ta, ma

loop:
    vsetvli t0, a2, e8, m1, ta, ma
    vlseg3e8.v v0, (a0)       # Load R,G,B into v0,v1,v2
                               # v0 = {r0, r1, r2, ...}
                               # v1 = {g0, g1, g2, ...}
                               # v2 = {b0, b1, b2, ...}

    # Widen to 16-bit for multiplication
    vwmulu.vx v4, v0, 77      # v4 = r * 77 (16-bit)
    vwmaccu.vx v4, v1, 150    # v4 += g * 150
    vwmaccu.vx v4, v2, 29     # v4 += b * 29

    # Shift right by 8, narrow to 8-bit
    vnsrl.wi v3, v4, 8        # v3 = v4 &amp;gt;&amp;gt; 8 (narrow to 8-bit)

    vse8.v v3, (a1)           # Store grayscale

    sub a2, a2, t0
    li t1, 3
    mul t2, t0, t1            # RGB offset = VL * 3
    add a0, a0, t2
    add a1, a1, t0
    bnez a2, loop

    ret
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Compiler Support
&lt;/h2&gt;

&lt;h3&gt;
  
  
  GCC Intrinsics
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;RVV intrinsics follow a pattern:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;riscv_vector.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;
&lt;span class="c1"&gt;// Naming: v&amp;lt;op&amp;gt;_&amp;lt;type&amp;gt;&amp;lt;mode&amp;gt;_&amp;lt;config&amp;gt;&lt;/span&gt;
&lt;span class="n"&gt;vfloat32m1_t&lt;/span&gt; &lt;span class="nf"&gt;vadd_vv_f32m1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vfloat32m1_t&lt;/span&gt; &lt;span class="n"&gt;vs2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                            &lt;span class="n"&gt;vfloat32m1_t&lt;/span&gt; &lt;span class="n"&gt;vs1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;vl&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example: SAXPY&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;saxpy_rvv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;vl&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;vl&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;vl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vsetvl_e32m1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// Set VL&lt;/span&gt;
        &lt;span class="n"&gt;vfloat32m1_t&lt;/span&gt; &lt;span class="n"&gt;vx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vle32_v_f32m1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vl&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// Load x&lt;/span&gt;
        &lt;span class="n"&gt;vfloat32m1_t&lt;/span&gt; &lt;span class="n"&gt;vy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vle32_v_f32m1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vl&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// Load y&lt;/span&gt;
        &lt;span class="n"&gt;vy&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vfmacc_vf_f32m1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vl&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;          &lt;span class="c1"&gt;// y += a*x&lt;/span&gt;
        &lt;span class="n"&gt;vse32_v_f32m1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vy&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vl&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;                  &lt;span class="c1"&gt;// Store y&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Auto-Vectorization
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Modern compilers can auto-vectorize:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;add_arrays&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;GCC with &lt;code&gt;-march=rv64gcv -O3&lt;/code&gt;:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Generates RVV vector instructions automatically!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Works best with:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple loops&lt;/li&gt;
&lt;li&gt;No dependencies&lt;/li&gt;
&lt;li&gt;Aligned data&lt;/li&gt;
&lt;li&gt;Hint with pragmas if needed&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Performance Analysis
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Theoretical Speedup
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Scalar code (1 FP32/cycle):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1000 elements → 1000 cycles
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;128-bit RVV (4 FP32/cycle):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1000 elements → 250 cycles (4× speedup)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;256-bit RVV (8 FP32/cycle):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1000 elements → 125 cycles (8× speedup)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;512-bit RVV (16 FP32/cycle):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1000 elements → 63 cycles (16× speedup)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Same binary.&lt;/strong&gt; Different hardware, different throughput.&lt;/p&gt;

&lt;h3&gt;
  
  
  Real-World Benchmarks
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Matrix multiplication (GEMM):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Implementation&lt;/th&gt;
&lt;th&gt;Performance (GFLOPS)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Scalar C&lt;/td&gt;
&lt;td&gt;0.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RVV (128-bit)&lt;/td&gt;
&lt;td&gt;3.2 (4× speedup)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RVV (256-bit)&lt;/td&gt;
&lt;td&gt;6.4 (8× speedup)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RVV (512-bit)&lt;/td&gt;
&lt;td&gt;12.8 (16× speedup)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Image convolution:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Filter Size&lt;/th&gt;
&lt;th&gt;Scalar&lt;/th&gt;
&lt;th&gt;RVV 128-bit&lt;/th&gt;
&lt;th&gt;RVV 256-bit&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;3×3&lt;/td&gt;
&lt;td&gt;45ms&lt;/td&gt;
&lt;td&gt;12ms (3.7×)&lt;/td&gt;
&lt;td&gt;6ms (7.5×)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5×5&lt;/td&gt;
&lt;td&gt;120ms&lt;/td&gt;
&lt;td&gt;32ms (3.75×)&lt;/td&gt;
&lt;td&gt;16ms (7.5×)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Close to theoretical speedup&lt;/strong&gt; with good algorithm design.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hardware Implementations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Commercial Silicon (2025)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Alibaba T-Head:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;XuanTie C910: 128-bit RVV 0.7.1&lt;/li&gt;
&lt;li&gt;XuanTie C920: 256-bit RVV 1.0&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;SiFive:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;P670: 256-bit RVV 1.0&lt;/li&gt;
&lt;li&gt;X280: 512-bit RVV 1.0 (HPC-focused)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Andes:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AX65: 128-bit RVV 1.0&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;SpacemiT:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;K1: 128-bit RVV 1.0 (8-core, consumer SBC)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  VLEN (Vector Register Length)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Common implementations:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;VLEN&lt;/th&gt;
&lt;th&gt;FP32 Elements&lt;/th&gt;
&lt;th&gt;Target Market&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;128-bit&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Embedded, IoT&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;256-bit&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;General purpose, edge AI&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;512-bit&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;HPC, servers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1024-bit&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;Supercomputing&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;All run the same binaries.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  RVV vs ARM SVE vs x86 AVX
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Code Portability
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;RVV:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;// One code path, works on all VLEN&lt;/span&gt;
&lt;span class="n"&gt;vfloat32m1_t&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vadd_vv_f32m1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vl&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;ARM SVE:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;// One code path, works on all SVE lengths&lt;/span&gt;
&lt;span class="n"&gt;svfloat32_t&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;svadd_f32_z&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;x86 AVX:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Different code per width&lt;/span&gt;
&lt;span class="cp"&gt;#ifdef __AVX512F__
&lt;/span&gt;    &lt;span class="n"&gt;__m512&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm512_add_ps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// 512-bit&lt;/span&gt;
&lt;span class="cp"&gt;#elif __AVX2__
&lt;/span&gt;    &lt;span class="n"&gt;__m256&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm256_add_ps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// 256-bit&lt;/span&gt;
&lt;span class="cp"&gt;#else
&lt;/span&gt;    &lt;span class="n"&gt;__m128&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm_add_ps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;     &lt;span class="c1"&gt;// 128-bit&lt;/span&gt;
&lt;span class="cp"&gt;#endif
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Winner:&lt;/strong&gt; RVV and SVE (length-agnostic)&lt;/p&gt;

&lt;h3&gt;
  
  
  Simplicity
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;RVV:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Simple mask model (single mask register v0)&lt;/li&gt;
&lt;li&gt;Straightforward vsetvl configuration&lt;/li&gt;
&lt;li&gt;32 vector registers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;SVE:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complex predicate registers (p0-p15)&lt;/li&gt;
&lt;li&gt;Governing predicates + first-fault loads&lt;/li&gt;
&lt;li&gt;32 vector registers + 16 predicates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;x86 AVX:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No length abstraction&lt;/li&gt;
&lt;li&gt;Different instruction sets per width&lt;/li&gt;
&lt;li&gt;Mask registers (AVX-512) add complexity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Winner:&lt;/strong&gt; RVV (simpler model)&lt;/p&gt;

&lt;h3&gt;
  
  
  Ecosystem
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;x86 AVX:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mature compiler support&lt;/li&gt;
&lt;li&gt;Extensive libraries&lt;/li&gt;
&lt;li&gt;Decades of optimization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;ARM SVE:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Growing compiler support&lt;/li&gt;
&lt;li&gt;ARM-specific (vendor lock)&lt;/li&gt;
&lt;li&gt;Limited consumer hardware&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;RVV:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Compiler support improving rapidly&lt;/li&gt;
&lt;li&gt;Open standard (no vendor lock-in)&lt;/li&gt;
&lt;li&gt;Growing hardware ecosystem&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Winner:&lt;/strong&gt; x86 (today), RVV (trajectory)&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Length-agnostic is the right model&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One binary, any vector width&lt;/li&gt;
&lt;li&gt;Future-proof code&lt;/li&gt;
&lt;li&gt;Hardware flexibility&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Simpler than ARM SVE&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Easier to learn and use&lt;/li&gt;
&lt;li&gt;Straightforward mask model&lt;/li&gt;
&lt;li&gt;Good compiler target&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Open standard advantage&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No vendor lock-in&lt;/li&gt;
&lt;li&gt;Custom extensions possible&lt;/li&gt;
&lt;li&gt;Growing ecosystem&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Not a drop-in x86 replacement (yet)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ecosystem still maturing&lt;/li&gt;
&lt;li&gt;Limited consumer hardware&lt;/li&gt;
&lt;li&gt;But trajectory is strong&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;5. Ideal for specialized domains&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Edge AI (custom VLEN for models)&lt;/li&gt;
&lt;li&gt;HPC (large VLEN for throughput)&lt;/li&gt;
&lt;li&gt;Embedded (small VLEN for power)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Getting Started with RVV
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Emulation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;QEMU:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install QEMU with RISC-V support&lt;/span&gt;
qemu-riscv64 &lt;span class="nt"&gt;-cpu&lt;/span&gt; rv64,v&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;,vlen&lt;span class="o"&gt;=&lt;/span&gt;256 ./my_rvv_program
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Spike (RISC-V ISA Simulator):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;spike &lt;span class="nt"&gt;--isa&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;rv64gcv ./my_rvv_program
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Development Boards
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;SpacemiT K1:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;8-core RISC-V&lt;/li&gt;
&lt;li&gt;128-bit RVV 1.0&lt;/li&gt;
&lt;li&gt;Linux support&lt;/li&gt;
&lt;li&gt;~$100&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;SiFive HiFive Unmatched:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;U74 cores (no RVV yet)&lt;/li&gt;
&lt;li&gt;Waiting for P670 upgrade&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cross-Compilation
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;GCC toolchain:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;riscv64-unknown-linux-gnu-gcc &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-march&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;rv64gcv &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-O3&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-o&lt;/span&gt; program &lt;span class="se"&gt;\&lt;/span&gt;
    program.c
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Intrinsics example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;riscv_vector.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;
&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;vector_add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;float&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;vl&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;vl&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;vl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vsetvl_e32m1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;vfloat32m1_t&lt;/span&gt; &lt;span class="n"&gt;va&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vle32_v_f32m1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;vl&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;vfloat32m1_t&lt;/span&gt; &lt;span class="n"&gt;vb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vle32_v_f32m1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;vl&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;vfloat32m1_t&lt;/span&gt; &lt;span class="n"&gt;vc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vfadd_vv_f32m1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;va&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vl&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;vse32_v_f32m1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;vc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vl&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;RISC-V Vector Extension brings length-agnostic SIMD to the open ISA ecosystem. By learning from x86’s fixed-width mistakes and ARM SVE’s complexity, RVV offers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Portable code across any vector width&lt;/li&gt;
&lt;li&gt;Simpler programming model&lt;/li&gt;
&lt;li&gt;Open standard flexibility&lt;/li&gt;
&lt;li&gt;Growing hardware and software ecosystem&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While still maturing compared to x86 AVX’s decades of optimization, RVV’s trajectory is strong. For edge AI, custom accelerators, and eventually general-purpose computing, RVV represents the future of portable high-performance vector processing.&lt;/p&gt;

&lt;p&gt;The question isn’t if RISC-V vectors will be ubiquitous, but when.&lt;/p&gt;




&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Specifications:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RISC-V Vector Extension 1.0 Specification&lt;/li&gt;
&lt;li&gt;RISC-V ISA Manual (Volume 2: Privileged)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Implementations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;SiFive P670/X280 documentation&lt;/li&gt;
&lt;li&gt;Alibaba T-Head XuanTie documentation&lt;/li&gt;
&lt;li&gt;Andes AX65 documentation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Tools:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GCC RISC-V Vector Intrinsics Guide&lt;/li&gt;
&lt;li&gt;LLVM RISC-V Backend Documentation&lt;/li&gt;
&lt;li&gt;QEMU RISC-V Emulation Guide&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Communities:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RISC-V International Vector SIG&lt;/li&gt;
&lt;li&gt;RISC-V Software mailing lists&lt;/li&gt;
&lt;li&gt;RISC-V Exchange forums&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Next in the series: vLLM’s PagedAttention - memory management for LLM serving&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Discussion:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What are your thoughts on RISC-V’s approach to vectors?&lt;br&gt;
Have you worked with ARM SVE or x86 AVX?&lt;br&gt;
What applications would benefit most from RVV?&lt;/p&gt;

&lt;p&gt;Share your thoughts &lt;/p&gt;

</description>
      <category>architecture</category>
      <category>simd</category>
      <category>riscv</category>
      <category>vectorprocessing</category>
    </item>
  </channel>
</rss>
