<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Wunk</title>
    <description>The latest articles on DEV Community by Wunk (@wunk).</description>
    <link>https://dev.to/wunk</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F128866%2F7ab18ece-15a9-4647-b6cb-85514dad19ff.png</url>
      <title>DEV Community: Wunk</title>
      <link>https://dev.to/wunk</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/wunk"/>
    <language>en</language>
    <item>
      <title>Morton encoding with AVX512</title>
      <dc:creator>Wunk</dc:creator>
      <pubDate>Fri, 28 Jun 2019 16:25:04 +0000</pubDate>
      <link>https://dev.to/wunk/morton-encoding-with-avx512-4hgk</link>
      <guid>https://dev.to/wunk/morton-encoding-with-avx512-4hgk</guid>
      <description>&lt;p&gt;In anticipation of &lt;a href="https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512"&gt;Icelake bringing AVX512 to consumer hardware&lt;/a&gt;, I've come up with an alternative to the &lt;code&gt;pdep&lt;/code&gt; method for encoding a 32-bit 2D coordinate into a 64-bit morton code which utilizes the &lt;a href="https://en.wikipedia.org/wiki/AVX-512#New_instructions_in_AVX-512_VPOPCNTDQ_and_AVX-512_BITALG"&gt;&lt;code&gt;vpshufbitqmb&lt;/code&gt; instruction&lt;/a&gt; of the BITALG subset of AVX512.&lt;br&gt;
&lt;code&gt;vpshufbitqmb&lt;/code&gt; allows building a new value bit by bit by indexing into a 64-bit lane(into an AVX512 &lt;code&gt;k&lt;/code&gt;-register mask).&lt;/p&gt;

&lt;p&gt;It relies on interpreting two 32-bit X and Y coordinates as a continuous 64-bit value though. It can support much higher dimensions provided you python-&lt;code&gt;zip&lt;/code&gt;(as in, feed in and interleave the leading bytes or so of the coordinates vector elements into each 64-bit lane before selecting bits) certain bytes of the input coordinates.&lt;/p&gt;

&lt;p&gt;There is no throughput or latency data at the moment for this instruction or any hardware to actually test this upon but the assembly looks very promising, compared to the currently fastest &lt;code&gt;pdep&lt;/code&gt; method. Especially for any bulk-processing where all the constants are loaded outside of the loop.&lt;/p&gt;

&lt;p&gt;An illustrative example of the two methods:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="cp"&gt;#pragma pack(push, 1)
&lt;/span&gt;&lt;span class="k"&gt;union&lt;/span&gt; &lt;span class="n"&gt;Coord2D&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt; &lt;span class="n"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt; &lt;span class="c1"&gt;// x, y&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt; &lt;span class="n"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="cp"&gt;#pragma pack(pop)
&lt;/span&gt;
&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt; &lt;span class="nf"&gt;Morton2D_AVX512&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;Coord2D&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;coord&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;__m512i&lt;/span&gt; &lt;span class="n"&gt;CoordVec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm512_set1_epi64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coord&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;u64&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// Interleave bits from 32-bit X and Y coordinate&lt;/span&gt;
    &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;__mmask64&lt;/span&gt; &lt;span class="n"&gt;Interleave&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm512_bitshuffle_epi64_mask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;CoordVec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;
        &lt;span class="n"&gt;_mm512_set_epi16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;31&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;29&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;27&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;26&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;23&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;22&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;21&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;19&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;18&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;  &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;  &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;  &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;  &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;  &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;  &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;  &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;  &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;  &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x20'00&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mh"&gt;0x01'01&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_cvtmask64_u64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Interleave&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt; &lt;span class="nf"&gt;Morton2D_BMI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;Coord2D&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;coord&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_pdep_u64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;static_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coord&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="mh"&gt;0x5555555555555555&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;_pdep_u64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;static_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;coord&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;u32&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt; &lt;span class="mh"&gt;0xAAAAAAAAAAAAAAAA&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The generated assembly using &lt;code&gt;gcc 9.1&lt;/code&gt; with &lt;code&gt;-Ofast -march=icelake-client&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Morton2D_AVX512(Coord2D const&amp;amp;):
        vpbroadcastq    zmm0, QWORD PTR [rdi]
        vpshufbitqmb    k0, zmm0, ZMMWORD PTR .LC0[rip]
        kmovq   rax, k0
        vzeroupper
        ret
Morton2D_BMI(Coord2D const&amp;amp;):
        mov     eax, DWORD PTR [rdi]
        movabs  rdx, 6148914691236517205
        pdep    rax, rax, rdx
        mov     edx, DWORD PTR [rdi+4]
        movabs  rcx, -6148914691236517206
        pdep    rdx, rdx, rcx
        or      rax, rdx
        ret
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;I have verified the implementation with &lt;a href="https://software.intel.com/en-us/articles/intel-software-development-emulator"&gt;Intel's emulator&lt;/a&gt; and I'd love to update with some actual benchmarks once I get some Icelake hardware in my hands.&lt;/p&gt;

</description>
      <category>cpp</category>
      <category>c</category>
      <category>performance</category>
      <category>algorithms</category>
    </item>
    <item>
      <title>Fast base2 decoding on the upcoming Intel Icelake</title>
      <dc:creator>Wunk</dc:creator>
      <pubDate>Fri, 31 May 2019 04:42:22 +0000</pubDate>
      <link>https://dev.to/wunk/fast-base2-decoding-on-the-upcoming-intel-icelake-1mk1</link>
      <guid>https://dev.to/wunk/fast-base2-decoding-on-the-upcoming-intel-icelake-1mk1</guid>
      <description>&lt;p&gt;This is a little writeup on some anticipatory code to eventually test and benchmark on the upcoming Intel Icelake architecture.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;pext&lt;/code&gt; instruction is a particularly useful instruction in &lt;a href="https://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets"&gt;BMI2&lt;/a&gt; that allows the programmer to provide a bit-mask integer with &lt;code&gt;1&lt;/code&gt; bits set in positions of interests for which the &lt;code&gt;pext&lt;/code&gt; instruction will &lt;strong&gt;ext&lt;/strong&gt;ract these bits in &lt;strong&gt;p&lt;/strong&gt;arallel and compact them all against the least-significnat bits.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Given a bitmask and an input, pext will select the bits where-ever there is a
set bit in the mask, and compress them together to produce a new result.

|0000100000001111100000100001111100010000000010000001001000010000|  &amp;lt; Operand A
  ^      ^             ^       ^                ^      ^    ^   ^
|0100000010000000000000100000001000000000000000010000001000010001|  &amp;lt; Mask
                                   | Extract bits at mask
                                   V
|.1......0.............1.......1................0......1....1...0|
                                   | Compress into new 64-bit integer
                                   V
|0000000000000000000000000000000000000000000000000000000000110010|  &amp;lt; Result
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;This instruction is very useful for tasks such as converting ascii &lt;a href="http://0x80.pl/notesen/2014-10-06-pext-convert-ascii-bin-to-num.html"&gt;base2&lt;/a&gt;&lt;br&gt;
and &lt;a href="http://0x80.pl/notesen/2014-10-09-pext-convert-ascii-hex-to-num.html"&gt;hexadecimal&lt;/a&gt; back into its original bytes of data and other uses in &lt;a href="http://0x80.pl/notesen/2019-01-05-avx512vbmi-remove-spaces.html#pext-based-algorithm-new"&gt;text-processing&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pext&lt;/code&gt; only has a 32-bit and 64-bit variants to process 4 or 8 bytes of data&lt;br&gt;
at once within a general-purpose register so it is only capable of processing&lt;br&gt;
one base-2 byte of data(8 ascii-characters of &lt;code&gt;0&lt;/code&gt; and &lt;code&gt;1&lt;/code&gt;) back into a regular byte at a time(it also requires a &lt;code&gt;bswap64&lt;/code&gt; due to endian-issues):&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Goes from ascii "00100101" to binary byte 0b00100101('a')&lt;/span&gt;
&lt;span class="c1"&gt;// the ascii string "00100101" is 8 bytes, so it fits perfectly&lt;/span&gt;
&lt;span class="c1"&gt;// with a 64-bit integer&lt;/span&gt;
&lt;span class="kr"&gt;inline&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt; &lt;span class="nf"&gt;DecodeBase2Word&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt; &lt;span class="n"&gt;BinAscii&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt; &lt;span class="n"&gt;CurInput&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;__builtin_bswap64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BinAscii&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt; &lt;span class="n"&gt;Binary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="cp"&gt;#if defined(__BMI2__)
&lt;/span&gt;    &lt;span class="c1"&gt;// Much faster, or is it?&lt;/span&gt;
    &lt;span class="n"&gt;Binary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_pext_u64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CurInput&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x0101010101010101UL&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="cp"&gt;#else
&lt;/span&gt;    &lt;span class="c1"&gt;// Serial bit extraction&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt; &lt;span class="n"&gt;Mask&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mh"&gt;0x0101010101010101UL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt; &lt;span class="n"&gt;CurBit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1UL&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;Mask&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;CurBit&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;CurInput&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;Mask&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;Mask&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;Binary&lt;/span&gt; &lt;span class="o"&gt;|=&lt;/span&gt; &lt;span class="n"&gt;CurBit&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;Mask&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Mask&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1UL&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="cp"&gt;#endif
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Binary&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;




&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512"&gt;Icelake&lt;/a&gt; introduces the &lt;code&gt;AVX512&lt;/code&gt; variant &lt;a href="https://en.wikipedia.org/wiki/AVX-512#New_instructions_in_AVX-512_VPOPCNTDQ_and_AVX-512_BITALG"&gt;BITALG&lt;/a&gt; which provides the intrinsic&lt;br&gt;
&lt;code&gt;_mm_bitshuffle_epi64_mask&lt;/code&gt; which emits &lt;a href="https://github.com/HJLebbink/asm-dude/wiki/VPSHUFBITQMB"&gt;vpshufbitqmb&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This instruction will go through each 64-bit lane in the second operand and&lt;br&gt;
treat it as eight 8-bit index values(modulo 64) into the bits of the other operand's 64-bit lane, then it will compact these 8 selected bits into a new 8-bit&lt;br&gt;
integer within the AVX512 mask-register.&lt;/p&gt;

&lt;p&gt;While this doesn't use a bitmask like &lt;code&gt;pext&lt;/code&gt;, it allow 64-bit integers&lt;br&gt;
to be treated as what is basically a Look Up Table of 64-bits to produce a new&lt;br&gt;
8-bit byte. Not only that, but it operates upon 64-bit lanes within SIMD registers allowing for up to 8 bytes to be converted at a time in parallel.&lt;br&gt;
2 bytes can be converted at a time in a 128-bit SSE register, 4 at a time in a&lt;br&gt;
256-bit AVX register, and 8 at a time in a 512-bit register.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  3210987654321098765432109876543210987654321098765432109876543210
  666655555555554444444444333333333322222222221111111111
  -----------------------------------------------------------------
  0000100000001111100000100001111100010000000010000001001000010000| &amp;lt; Operand A
  |^      ^             ^       ^     +----------^      ^    ^   ^|
  ||      |             |       |     |       +---------+    |+--+|
  |+--+   +---+       +-+     +-+     |       |       +------+|   |
  |   |       |       |       |       |       |       |       |   |
  |---+-------+-------+-------+-------+-------+-------+-------+---|
  |  62   |  55   |  41   |  33   |  16   |   9   |   4   |   0   | &amp;lt; Operand B
  +---------------------------------------------------------------+
                                  | Get bits at index
                                  V
  +---------------------------------------------------------------+
  |   0   |   0   |   1   |   1   |   0   |   0   |   1   |   0   |
  +---------------------------------------------------------------+
                                  | Compress into new 8-bit integer
                                  V
                          +----------------+
                          |   0b00110010   |
                          +----------------+
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;With this, a basic greedy algorithm can be made to process different widths of base2 ascii back into its original bytes. The least significant bit of the ascii bytes for &lt;code&gt;0&lt;/code&gt; and &lt;code&gt;1&lt;/code&gt; are also &lt;code&gt;0&lt;/code&gt; and &lt;code&gt;1&lt;/code&gt;. By extracting and compacting these bits together, we can convert ascii bytes back into their original bytes.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre class="highlight cpp"&gt;&lt;code&gt;
&lt;span class="c1"&gt;// '0' : 0b00110000&lt;/span&gt;
&lt;span class="c1"&gt;// '1' : 0b00110001&lt;/span&gt;
&lt;span class="c1"&gt;//                ^ Extract and compress these bits&lt;/span&gt;
&lt;span class="c1"&gt;//                  the rest of he bits stay the same! (0x30)&lt;/span&gt;
&lt;span class="c1"&gt;//                  (assuming you've validated your input)&lt;/span&gt;

&lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;Base2Decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt; &lt;span class="n"&gt;Input&lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt; &lt;span class="n"&gt;Output&lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;Length&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="c1"&gt;// 8 at a time&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;Length&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;__mmask64&lt;/span&gt; &lt;span class="n"&gt;Compressed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm512_bitshuffle_epi64_mask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;_mm512_loadu_si512&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;__m512i&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Input&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
            &lt;span class="n"&gt;_mm512_set1_epi64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mh"&gt;0x00'08'10'18'20'28'30'38&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;_store_mask64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;__mmask64&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Output&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;Compressed&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c1"&gt;// 4 at a time&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;Length&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;__mmask32&lt;/span&gt; &lt;span class="n"&gt;Compressed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm256_bitshuffle_epi64_mask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;_mm256_loadu_si256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;__m256i&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Input&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
            &lt;span class="n"&gt;_mm256_set1_epi64x&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mh"&gt;0x00'08'10'18'20'28'30'38&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;_store_mask32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;__mmask32&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Output&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;Compressed&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c1"&gt;// 2 at a time&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;Length&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;__mmask16&lt;/span&gt; &lt;span class="n"&gt;Compressed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm_bitshuffle_epi64_mask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;_mm_loadu_si128&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;__m128i&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Input&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
            &lt;span class="n"&gt;_mm_set1_epi64x&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mh"&gt;0x00'08'10'18'20'28'30'38&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;_store_mask16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;__mmask16&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Output&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;Compressed&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c1"&gt;// Serial(could probably just use the pext implementation here but I'm demonstrating bitshuffle_epi64 here)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;Length&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;__mmask16&lt;/span&gt; &lt;span class="n"&gt;Compressed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm_bitshuffle_epi64_mask&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;_mm_loadl_epi64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;__m128i&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Input&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
            &lt;span class="n"&gt;_mm_set1_epi64x&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mh"&gt;0x00'08'10'18'20'28'30'38&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;Output&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;static_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_cvtmask16_u32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Compressed&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// "Hello World!"&lt;/span&gt;
    &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;Input&lt;/span&gt;
    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="s"&gt;"010010000110010101101100011011000110111100100000010101110110111101110010011011000110010000100001"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt; &lt;span class="n"&gt;Output&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;

    &lt;span class="n"&gt;Base2Decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;printf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Output: '%.12s'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Output&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;As of now(Thu 30 May 2019 01:53:47 PM PDT) Icelake is not out yet so there is no&lt;br&gt;
way for me to actually benchmark this on hardware to get some performance numbers&lt;br&gt;
but it is theretically up to &lt;strong&gt;x8&lt;/strong&gt; times faster than the &lt;em&gt;already-fast&lt;/em&gt; &lt;code&gt;pext&lt;/code&gt; method of decoding base2-ascii back into binary!&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--vJ70wriM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://practicaldev-herokuapp-com.freetls.fastly.net/assets/github-logo-ba8488d21cd8ee1fee097b8410db9deaa41d0ca30b004c0c63de0a479114156f.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/Wunkolo"&gt;
        Wunkolo
      &lt;/a&gt; / &lt;a href="https://github.com/Wunkolo/base2"&gt;
        base2
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      A base2 implementation similar to gnu coreutil's base64
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;h1&gt;
base2 &lt;a href="https://raw.githubusercontent.com/Wunkolo/base2/master/LICENSE"&gt;&lt;img src="https://camo.githubusercontent.com/890acbdcb87868b382af9a4b1fac507b9659d9bf/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e73652d4d49542d626c75652e737667" alt="GitHub license"&gt;&lt;/a&gt;
&lt;/h1&gt;
&lt;p&gt;In the same spirit as the gnu coreutils software &lt;a href="https://www.gnu.org/software/coreutils/manual/html_node/base64-invocation.html" rel="nofollow"&gt;base64&lt;/a&gt;, &lt;code&gt;base2&lt;/code&gt; transforms data read from a file, or standard input, into (or from) base2(binary text) encoded form.&lt;/p&gt;
&lt;p&gt;Because I was bored.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;base2 - Wunkolo &amp;lt;wunkolo@gmail.com&amp;amp;gt
Usage: base2 [Options]... [File]
       base2 --decode [Options]... [File]
Options
  -h, --help            Display this help/usage information
  -d, --decode          Decode's incoming binary ascii into bytes
  -i, --ignore-garbage  When decoding, ignores non-ascii-binary `0`, `1` bytes
  -w, --wrap=Columns    Wrap encoded binary output within columns
                        Default is `76`. `0` Disables linewrapping
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Encoding:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% base2 &amp;lt;&amp;lt;&amp;lt; 'QWERTY'
01010001010101110100010101010010010101000101100100001010
% base2 --wrap=8 &amp;lt;&amp;lt;&amp;lt; 'QWERTY'
01010001                          # 'Q'
01010111                          # 'W'
01000101                          # 'E'
01010010                          # 'R'
01010100                          # 'T'
01011001                          # 'Y'
00001010                          # '\n'
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Decoding:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;% base2 -d &amp;lt;&amp;lt;&amp;lt; '01010001010101110100010101010010010101000101100100001010'
QWERTY
% base2 -d &amp;lt;&amp;lt;&amp;lt; '010100010101
011101000garbage1010blah101001001010garbage1000101100100001010'
QWFz*J�B
% base2 -d -i &amp;lt;&amp;lt;&amp;lt; '010100010
101011101000garbage1010blah101001001010garbage1000101100100001010'
QWERTY
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Did I mention its fast:&lt;/p&gt;
&lt;p&gt;&lt;a href="https://en.wikichip.org/wiki/intel/core_i3/i3-6100" rel="nofollow"&gt;i3-6100&lt;/a&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;inxi -C
CPU:       Topology: Dual Core model:&lt;/code&gt;&lt;/pre&gt;…&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/Wunkolo/base2"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;



</description>
      <category>cpp</category>
      <category>c</category>
      <category>performance</category>
      <category>algorithms</category>
    </item>
    <item>
      <title>Fast array reversal with SIMD!</title>
      <dc:creator>Wunk</dc:creator>
      <pubDate>Sun, 05 May 2019 17:41:39 +0000</pubDate>
      <link>https://dev.to/wunk/fast-array-reversal-with-simd-j3p</link>
      <guid>https://dev.to/wunk/fast-array-reversal-with-simd-j3p</guid>
      <description>&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;Serial&lt;/td&gt;
&lt;td&gt;bswap/rev&lt;/td&gt;
&lt;td&gt;SSSE3/Neon&lt;/td&gt;
&lt;td&gt;AVX2&lt;/td&gt;
&lt;td&gt;AVX512&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pattern&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fasnhe2p6eykokolt9l4q.gif" alt="Serial"&gt;&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn1xv669t3gdp7vvvydds.gif" alt="bswap/rev"&gt;&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr4ctzy3i4ggo3t3aojms.gif" alt="SSSE3"&gt;&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F63x1lty2cqbbjmr790yr.gif" alt="AVX2"&gt;&lt;/td&gt;
&lt;td&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fopujtpvsm6x3ya0i3vac.gif" alt="AVX512"&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Processor&lt;/td&gt;
&lt;td&gt;Speedup&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://en.wikichip.org/wiki/intel/core_i9/i9-7900x" rel="noopener noreferrer"&gt;i9-7900x&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;x1&lt;/td&gt;
&lt;td&gt;x15.386&lt;/td&gt;
&lt;td&gt;x10.417&lt;/td&gt;
&lt;td&gt;x22.032&lt;/td&gt;
&lt;td&gt;x22.357&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://en.wikichip.org/wiki/intel/core_i3/i3-6100" rel="noopener noreferrer"&gt;i3-6100&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;x1&lt;/td&gt;
&lt;td&gt;x15.8&lt;/td&gt;
&lt;td&gt;x10.5&lt;/td&gt;
&lt;td&gt;x16.053&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://en.wikichip.org/wiki/intel/core_i5/i5-8600k" rel="noopener noreferrer"&gt;i5-8600K&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;x1&lt;/td&gt;
&lt;td&gt;x15.905&lt;/td&gt;
&lt;td&gt;x10.21&lt;/td&gt;
&lt;td&gt;x16.076&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://en.wikichip.org/wiki/intel/core_i5/i5-8600k" rel="noopener noreferrer"&gt;E5-2697 v4&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;x1&lt;/td&gt;
&lt;td&gt;x16.701&lt;/td&gt;
&lt;td&gt;x15.716&lt;/td&gt;
&lt;td&gt;x19.141&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://en.wikipedia.org/wiki/Broadcom_Corporation#Raspberry_Pi" rel="noopener noreferrer"&gt;BCM2837&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;x1&lt;/td&gt;
&lt;td&gt;x7.391&lt;/td&gt;
&lt;td&gt;x7.718&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;p&gt;Array reversal implementations typically involve swapping both ends of the array and working down to the middle-most elements. C++ being type-aware treats array elements as objects and will call overloaded class operators such as &lt;code&gt;operator=&lt;/code&gt; or a &lt;code&gt;copy by reference&lt;/code&gt; constructor where available. Many implementations of a "swap" function would use an intermediate temporary variable to make the exchange which would require a minimum of two calls to an object's &lt;code&gt;operator=&lt;/code&gt; and at least one call to an object's &lt;code&gt;copy by reference&lt;/code&gt; constructor. Some other novel algorithms use the xor-swap technique after making some assumptions about the data being swapped(integer-data, register-bound, no overrides, etc). &lt;code&gt;std::swap&lt;/code&gt; also allows an overload of &lt;code&gt;swap&lt;/code&gt; for a type to be used if it is within the same namespace as your type should you want to expose your overloaded method to C++'s standard algorithm library during the reversal&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;

&lt;span class="c1"&gt;// Taken from http://en.cppreference.com/w/cpp/algorithm/reverse&lt;/span&gt;
&lt;span class="c1"&gt;// Example of using std::reverse&lt;/span&gt;
&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;vector&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;iostream&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;iterator&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;algorithm&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;
&lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;begin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;cout&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sc"&gt;'\n'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;begin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;end&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="n"&gt;cout&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="sc"&gt;'\n'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Note that for an odd-numbered amount of elements the middle-most element is already exactly where it needs to be and doesn't need to be moved.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fasnhe2p6eykokolt9l4q.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fasnhe2p6eykokolt9l4q.gif"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Should &lt;code&gt;std::reverse&lt;/code&gt; be called upon a "&lt;strong&gt;P&lt;/strong&gt;lain &lt;strong&gt;O&lt;/strong&gt;l &lt;strong&gt;D&lt;/strong&gt;ata"(POD) type such as &lt;code&gt;std::uint8_t&lt;/code&gt;(aka &lt;code&gt;unsigned char&lt;/code&gt;) or a plain &lt;code&gt;struct&lt;/code&gt; type then compilers can safely assume that your data doesn't have any special assignment/copy overrides to worry about and can treated as raw bytes. This assumption can allow for the compiler to optimize the reversal routine into something simple and much more &lt;code&gt;memcpy&lt;/code&gt;-like.&lt;/p&gt;

&lt;p&gt;The emitted x86 of a &lt;code&gt;std::reverse&lt;/code&gt; on an array of &lt;code&gt;std::uint8_t&lt;/code&gt; generally looks something like this.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

         ; std::reverse for std::uint8_t
         0x000014a0 cmp rsi, rdi
     /=&amp;lt; 0x000014a3 je 0x14c5
     |   0x000014a5 sub rsi, 1
     |   0x000014a9 cmp rdi, rsi
    /==&amp;lt; 0x000014ac jae 0x14c5
   .---&amp;gt; 0x000014ae movzx eax, byte [rdi] ; Load two bytes, one from
   |||   0x000014b1 movzx edx, byte [rsi] ; each end.
   |||   0x000014b4 mov byte [rdi], dl    ; Write them at opposite
   |||   0x000014b6 mov byte [rsi], al    ; ends.
   |||   0x000014b8 add rdi, 1            ; Shift index at
   |||   0x000014bc sub rsi, 1            ; both ends "inward" toward the middle
   |||   0x000014c0 cmp rdi, rsi
   \===&amp;lt; 0x000014c3 jb 0x14ae
    \\-&amp;gt; 0x000014c5 ret


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;When making qReverse, the primary interface implements a templated algorithm that follows the same logic.&lt;br&gt;
The element-size at compile-time will be templated and emit a pseudo-structure that fits this exact size in an attempt to keep this illustrative implementation as generic as possible for an element of &lt;em&gt;any&lt;/em&gt; size in bytes. By having the element-size be templated it will be a lot easier to implement specializations for certain element-sizes while all other non-specialized element sizes fall-back to the serial algorithm.&lt;br&gt;
Doing this with a template allows only the proper specializations to be instanced at compile-time as opposed to comparing an element-size variable against a list of available implementations at run-time.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;

&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;ElementSize&lt;/span&gt; &lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="kr"&gt;inline&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="nf"&gt;qReverse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;Count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// An abstraction to treat the array elements as raw bytes&lt;/span&gt;
    &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="nc"&gt;ByteElement&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt; &lt;span class="n"&gt;u8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ElementSize&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="n"&gt;ByteElement&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;ArrayN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;ByteElement&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// If compiler adds any padding/alignment bytes(and some do) then assert out&lt;/span&gt;
    &lt;span class="k"&gt;static_assert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ByteElement&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;ElementSize&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s"&gt;"ByteElement is pad-aligned and does not match specified element size"&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Only iterate through half of the size of the Array&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Exchange the upper and lower element as we work our&lt;/span&gt;
        &lt;span class="c1"&gt;// way down to the middle from either end&lt;/span&gt;
        &lt;span class="n"&gt;ByteElement&lt;/span&gt; &lt;span class="n"&gt;Temp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ArrayN&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
        &lt;span class="n"&gt;ArrayN&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ArrayN&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="n"&gt;ArrayN&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Temp&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Emitted assemblies from gcc with certain template specializations:&lt;/p&gt;

&lt;blockquote&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;

&lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;Reverse8&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qReverse&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

void qReverse&amp;lt;1ul&amp;gt;(void*, unsigned long):
  mov rcx, rsi
  shr rcx
  je .L1
  lea rdx, [rsi-1]
  lea rax, [rdi+rdx]
  sub rdx, rcx
  lea rdx, [rdi+rdx]
.L3:
  movzx ecx, BYTE PTR [rdi] ; Load bytes at each end
  movzx esi, BYTE PTR [rax]
  sub rax, 1                ; Move indexs "inwards"
  add rdi, 1
  mov BYTE PTR [rdi-1], sil ; Place them at the other end
  mov BYTE PTR [rax+1], cl
  cmp rax, rdx              ; Loop
  jne .L3
.L1:
  rep ret


&lt;/code&gt;&lt;/pre&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;

&lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;Reverse16&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qReverse&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

void qReverse&amp;lt;2ul&amp;gt;(void*, unsigned long):
  mov rdx, rsi
  shr rdx
  je .L17
  lea rax, [rdi-2+rsi*2]
  add rdx, rdx
  mov rsi, rax
  sub rsi, rdx
.L12:
  movzx edx, WORD PTR [rdi] ; Same as above
  movzx ecx, WORD PTR [rax]
  sub rax, 2
  add rdi, 2
  mov WORD PTR [rdi-2], cx
  mov WORD PTR [rax+2], dx
  cmp rax, rsi
  jne .L12
.L17:
  rep ret


&lt;/code&gt;&lt;/pre&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;

&lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;Reverse32&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qReverse&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

void qReverse&amp;lt;4ul&amp;gt;(void*, unsigned long):
  mov rdx, rsi
  shr rdx
  je .L25
  lea rax, [rdi-4+rsi*4]
  sal rdx, 2
  mov rsi, rax
  sub rsi, rdx
.L20:
  mov edx, DWORD PTR [rdi] ; Same as above
  mov ecx, DWORD PTR [rax]
  sub rax, 4
  add rdi, 4
  mov DWORD PTR [rdi-4], ecx
  mov DWORD PTR [rax+4], edx
  cmp rax, rsi
  jne .L20
.L25:
  rep ret


&lt;/code&gt;&lt;/pre&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;

&lt;span class="k"&gt;auto&lt;/span&gt; &lt;span class="n"&gt;Reverse24&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qReverse&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

void qReverse&amp;lt;3ul&amp;gt;(void*, unsigned long):
  mov rdx, rsi
  shr rdx
  je .L25
  lea rax, [rsi-3+rsi*2]
  lea rdx, [rdx+rdx*2]
  add rax, rdi
  mov r8, rax
  sub r8, rdx
.L20:
  movzx esi, WORD PTR [rax] ; Due to the element being 3 bytes
  movzx ecx, WORD PTR [rdi] ; The arithmetic gets a little weird
  sub rax, 3                ; But the principle is the same
  movzx edx, BYTE PTR [rdi+2]
  add rdi, 3
  mov WORD PTR [rdi-3], si
  movzx esi, BYTE PTR [rax+5]
  mov WORD PTR [rsp-3], cx
  mov BYTE PTR [rsp-1], dl
  mov BYTE PTR [rdi-1], sil
  mov WORD PTR [rax+3], cx
  mov BYTE PTR [rax+5], dl
  cmp rax, r8
  jne .L20
.L25:
  rep ret


&lt;/code&gt;&lt;/pre&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;From here it gets better!&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This "plain ol data" assumption can be made for lots of different types of data. Most usages of &lt;code&gt;struct&lt;/code&gt; are intended to be treated as "bags of data" and do not have the limitation of additional memory-movement logic for copying or swapping since they are intended only to communicate a structure of interpretation of bytes. The more obvious case-study can also be having an array of &lt;code&gt;chars&lt;/code&gt; found in an ASCII &lt;code&gt;string&lt;/code&gt; or maybe a row of &lt;code&gt;uint32_t&lt;/code&gt; pixel data. If the array elements are aligned to register-sizes(which tend to be powers of 2) then these in-register byte swaps and shuffling can be especially useful. &lt;strong&gt;From this point on assume that the array of data is to be interpreted as these "bags of data" instances&lt;/strong&gt; that do not involve any kind of &lt;code&gt;operator=&lt;/code&gt; or &lt;code&gt;Foo (const Foo&amp;amp;)&lt;/code&gt; type of overhead logic so the data may be safely interpreted strictly as bytes, think &lt;code&gt;memcpy&lt;/code&gt;-like.&lt;/p&gt;

&lt;p&gt;Most of the market are running 64-bit or 32-bit machines or have register sizes that are easily much bigger than just 1 byte(the animation above had register sizes that are 4 bytes, which is the size of a single 32-bit register). An observation is that this can speed this up is by loading in a full register-sized chunk of bytes, flipping this chunk of bytes within the register, and then placing it on the other end! Swapping all the bytes in the registers is a popular operation in networking called an &lt;code&gt;endian swap&lt;/code&gt; and x86 happens to have just the instruction to do this!&lt;/p&gt;
&lt;h1&gt;
  
  
  bswap
&lt;/h1&gt;

&lt;p&gt;The &lt;code&gt;bswap&lt;/code&gt; instruction reverses the individual bytes of a register and is typically used to swap the &lt;code&gt;endian&lt;/code&gt; of an integer to exchange between &lt;code&gt;host&lt;/code&gt; and &lt;code&gt;network&lt;/code&gt; byte-order(see &lt;code&gt;htons&lt;/code&gt;,&lt;code&gt;htonl&lt;/code&gt;,&lt;code&gt;ntohs&lt;/code&gt;,&lt;code&gt;ntohl&lt;/code&gt;). Most x86 compilers implement assembly intrinsics that you can put right into your C or C++ code to get the compiler to emit the &lt;code&gt;bswap&lt;/code&gt; instruction directly:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MSVC:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;_byteswap_uint64&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;_byteswap_ulong&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;_byteswap_ushort&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;GCC/Clang:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;_builtin_bswap64&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;_builtin_bswap32&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;_builtin_bswap16&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The x86 header &lt;code&gt;immintrin.h&lt;/code&gt; also includes &lt;code&gt;_bswap&lt;/code&gt; and &lt;code&gt;_bswap64&lt;/code&gt;. Otherwise a more generic and portable implementation can be used as well to be more architecture-generic.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;

&lt;span class="kr"&gt;inline&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt; &lt;span class="nf"&gt;Swap64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mh"&gt;0x00000000000000FF&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;56&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
        &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mh"&gt;0x000000000000FF00&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
        &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mh"&gt;0x0000000000FF0000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
        &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mh"&gt;0x00000000FF000000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;  &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
        &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mh"&gt;0x000000FF00000000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;  &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
        &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mh"&gt;0x0000FF0000000000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
        &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mh"&gt;0x00FF000000000000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
        &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mh"&gt;0xFF00000000000000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;56&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kr"&gt;inline&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt; &lt;span class="nf"&gt;Swap32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mh"&gt;0x000000FF&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
        &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mh"&gt;0x0000FF00&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;  &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
        &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mh"&gt;0x00FF0000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;  &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
        &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mh"&gt;0xFF000000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kr"&gt;inline&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint16_t&lt;/span&gt; &lt;span class="nf"&gt;Swap16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint16_t&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// This tends to emit a 16-bit `rol` or `ror` instruction&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mh"&gt;0x00FF&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;  &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
        &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mh"&gt;0xFF00&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;  &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Most compilers are able to detect when an in-register endian-swap is being done like above and will emit &lt;code&gt;bswap&lt;/code&gt; automatically or a similar intrinsic for your target architecture(The ARM architecture has the &lt;code&gt;rev&lt;/code&gt; instruction for &lt;strong&gt;armv6&lt;/strong&gt; or newer). Note also that &lt;code&gt;bswap16&lt;/code&gt; is basically just a 16-bit rotate of 1 byte which is the &lt;code&gt;rol&lt;/code&gt; or &lt;code&gt;ror&lt;/code&gt; instruction.&lt;/p&gt;

&lt;p&gt;x86_64 (gcc):&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

Swap64(unsigned long):
  mov rax, rdi
  bswap rax
  ret
Swap32(unsigned int):
  mov eax, edi
  bswap eax
  ret
Swap16(unsigned short):
  mov eax, edi
  rol ax, 8
  ret


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;x86_64 (clang):&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

Swap64(unsigned long): # @Swap64(unsigned long)
  bswap rdi
  mov rax, rdi
  ret

Swap32(unsigned int): # @Swap32(unsigned int)
  bswap edi
  mov eax, edi
  ret

Swap16(unsigned short): # @Swap16(unsigned short)
  rol di, 8
  mov eax, edi
  ret


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;ARM64 (gcc):&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight armasm"&gt;&lt;code&gt;

&lt;span class="nl"&gt;Swap64&lt;/span&gt;&lt;span class="err"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;unsigned&lt;/span&gt; &lt;span class="nb"&gt;long&lt;/span&gt;&lt;span class="err"&gt;):&lt;/span&gt;
  &lt;span class="nb"&gt;rev&lt;/span&gt; &lt;span class="nv"&gt;x0&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;x0&lt;/span&gt;
  &lt;span class="nb"&gt;ret&lt;/span&gt;
&lt;span class="nl"&gt;Swap32&lt;/span&gt;&lt;span class="err"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;unsigned&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="err"&gt;):&lt;/span&gt;
  &lt;span class="nb"&gt;rev&lt;/span&gt; &lt;span class="nv"&gt;w0&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;w0&lt;/span&gt;
  &lt;span class="nb"&gt;ret&lt;/span&gt;
&lt;span class="nl"&gt;Swap16&lt;/span&gt;&lt;span class="err"&gt;(&lt;/span&gt;&lt;span class="nl"&gt;unsigned&lt;/span&gt; &lt;span class="nb"&gt;short&lt;/span&gt;&lt;span class="err"&gt;):&lt;/span&gt;
  &lt;span class="nb"&gt;rev16&lt;/span&gt; &lt;span class="nv"&gt;w0&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;w0&lt;/span&gt;
  &lt;span class="nb"&gt;ret&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Using 32-bit &lt;code&gt;bswap&lt;/code&gt;s, the algorithm can take a 4-byte &lt;em&gt;chunk&lt;/em&gt; of bytes from either end into registers, &lt;code&gt;bswap&lt;/code&gt; the register, and then place the reversed &lt;em&gt;chunks&lt;/em&gt; at the opposite ends. As the algorithm gets closer to the center it can use smaller 16-bit swaps(aka a 16-bit rotate) should it encounter 2-byte chunks and eventually do serial swaps to anything left over.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3emi4ny4exr905eo8j7i.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3emi4ny4exr905eo8j7i.gif"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;and this of course can be expanded into a 64-bit &lt;code&gt;bswap&lt;/code&gt; on a 64-bit architecture allowing for even larger chunks to be reversed at once. Once again, first exhaust as many 8-byte swaps, as possible, then do the 4-byte swaps, then the two-2byte swaps, and finally fallback onto the serial 1-byte swaps if need be:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn1xv669t3gdp7vvvydds.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn1xv669t3gdp7vvvydds.gif"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Given an array of &lt;code&gt;11&lt;/code&gt; bytes to be reversed(odd number, so the middle byte stays the same), divide the array size by two to get the number of &lt;em&gt;single-element&lt;/em&gt; swaps to do(using whole-integer arithmetic):&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;11 / 2 = 5&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So &lt;code&gt;5&lt;/code&gt; single-element serial swaps are needed to reverse this array. Now that there is a way to do &lt;code&gt;4&lt;/code&gt; element chunks at once too, integer-divide this result &lt;code&gt;5&lt;/code&gt; again by &lt;code&gt;4&lt;/code&gt; to know how many &lt;em&gt;four-element&lt;/em&gt; swaps needed. The remainder of this division is the number of serial swaps still needed once all the four-element swaps have been exhausted:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;code&gt;5 / 4 = 1&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;5 % 4 = 1&lt;/code&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So only one 4-byte &lt;code&gt;bswap&lt;/code&gt;-swap and one &lt;code&gt;naive&lt;/code&gt;-swap is needed to fully reverse an 11-element array. Now the 1-byte qReverse template specialization can be added.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;

&lt;span class="c1"&gt;// Reverse an array of 1-byte elements(such as std::uint8_t)&lt;/span&gt;

&lt;span class="c1"&gt;// A specialization of the above implementation for 1-byte elements&lt;/span&gt;
&lt;span class="c1"&gt;// Does not call assignment or copy overloads&lt;/span&gt;
&lt;span class="c1"&gt;// Accelerated using - 64,32 and 16 bit bswap instructions&lt;/span&gt;
&lt;span class="k"&gt;template&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="kr"&gt;inline&lt;/span&gt; &lt;span class="kt"&gt;void&lt;/span&gt; &lt;span class="n"&gt;qReverse&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;void&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;Count&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;Array8&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Using a new iteration variable "j" to illustrate that we know&lt;/span&gt;
    &lt;span class="c1"&gt;// the exact amount of times we have to use our chunk-swaps&lt;/span&gt;
    &lt;span class="c1"&gt;// BSWAP 64&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Get bswapped versions of our Upper and Lower 8-byte chunks&lt;/span&gt;
        &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt; &lt;span class="n"&gt;Lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Swap64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt; &lt;span class="n"&gt;Upper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Swap64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;// Place them at their swapped position&lt;/span&gt;
        &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Upper&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Lower&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="c1"&gt;// Eight elements at a time&lt;/span&gt;
        &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c1"&gt;// BSWAP 32&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Get bswapped versions of our Upper and Lower 4-byte chunks&lt;/span&gt;
        &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt; &lt;span class="n"&gt;Lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Swap32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt; &lt;span class="n"&gt;Upper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Swap32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;// Place them at their swapped position&lt;/span&gt;
        &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Upper&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Lower&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="c1"&gt;// Four elements at a time&lt;/span&gt;
        &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c1"&gt;// BSWAP 16&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Get bswapped versions of our Upper and Lower 4-byte chunks&lt;/span&gt;
        &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint16_t&lt;/span&gt; &lt;span class="n"&gt;Lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Swap16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint16_t&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint16_t&lt;/span&gt; &lt;span class="n"&gt;Upper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Swap16&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint16_t&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;// Place them at their swapped position&lt;/span&gt;
        &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint16_t&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Upper&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint16_t&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Lower&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="c1"&gt;// Two elements at a time&lt;/span&gt;
        &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c1"&gt;// Everything else that we can not do a bswap on, we swap normally&lt;/span&gt;
    &lt;span class="c1"&gt;// Naive swaps&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Exchange the upper and lower element as we work our&lt;/span&gt;
        &lt;span class="c1"&gt;// way down to the middle from either end&lt;/span&gt;
        &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt; &lt;span class="n"&gt;Temp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
        &lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Temp&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;On some architectures and compilers, the &lt;a href="https://www.felixcloutier.com/x86/MOVBE.html" rel="noopener noreferrer"&gt;movebe&lt;/a&gt; instruction may be emitted which is a CPU instruction that can either read OR write from a memory location AND swap the endian of it, all in one instruction. This instruction is supported since the &lt;code&gt;haswell&lt;/code&gt; architecture of intel processors and &lt;code&gt;excavator&lt;/code&gt; architecture of AMD processors and will be used automatically if you use &lt;code&gt;-march=native&lt;/code&gt; with a processor that supports it.&lt;/p&gt;

&lt;p&gt;And now some benchmarks: on a &lt;em&gt;i3-6100&lt;/em&gt; with &lt;em&gt;8gb of DDR4 ram&lt;/em&gt;. I automated the benchmark process across several different array-sizes giving each array-size &lt;code&gt;10,000&lt;/code&gt; array-reversal trials before getting an average execution time for the given array-size. Using g++ compile flags: &lt;code&gt;-m64 -Ofast -march=native&lt;/code&gt; these are the results of comparing the execution time of the current bswap &lt;code&gt;qreverse&lt;/code&gt; algorithm against &lt;code&gt;std::reverse&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Element Count&lt;/th&gt;
&lt;th&gt;std::reverse&lt;/th&gt;
&lt;th&gt;qReverse&lt;/th&gt;
&lt;th&gt;Speedup Factor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;19 ns&lt;/td&gt;
&lt;td&gt;19 ns&lt;/td&gt;
&lt;td&gt;&lt;em&gt;1.000&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;20 ns&lt;/td&gt;
&lt;td&gt;19 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.053&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;24 ns&lt;/td&gt;
&lt;td&gt;23 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.043&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;36 ns&lt;/td&gt;
&lt;td&gt;23 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.565&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;td&gt;54 ns&lt;/td&gt;
&lt;td&gt;21 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.571&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;256&lt;/td&gt;
&lt;td&gt;90 ns&lt;/td&gt;
&lt;td&gt;22 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.091&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;td&gt;159 ns&lt;/td&gt;
&lt;td&gt;26 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;6.115&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;298 ns&lt;/td&gt;
&lt;td&gt;35 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8.514&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;43 ns&lt;/td&gt;
&lt;td&gt;21 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.048&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1000&lt;/td&gt;
&lt;td&gt;290 ns&lt;/td&gt;
&lt;td&gt;36 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8.056&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10000&lt;/td&gt;
&lt;td&gt;2740 ns&lt;/td&gt;
&lt;td&gt;191 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;14.346&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100000&lt;/td&gt;
&lt;td&gt;27511 ns&lt;/td&gt;
&lt;td&gt;1739 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;15.820&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1000000&lt;/td&gt;
&lt;td&gt;279525 ns&lt;/td&gt;
&lt;td&gt;24710 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;11.312&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;59&lt;/td&gt;
&lt;td&gt;32 ns&lt;/td&gt;
&lt;td&gt;22 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.455&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;79&lt;/td&gt;
&lt;td&gt;43 ns&lt;/td&gt;
&lt;td&gt;27 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.593&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;173&lt;/td&gt;
&lt;td&gt;63 ns&lt;/td&gt;
&lt;td&gt;29 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.172&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6133&lt;/td&gt;
&lt;td&gt;1680 ns&lt;/td&gt;
&lt;td&gt;127 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;13.228&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10177&lt;/td&gt;
&lt;td&gt;2784 ns&lt;/td&gt;
&lt;td&gt;190 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;14.653&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25253&lt;/td&gt;
&lt;td&gt;6864 ns&lt;/td&gt;
&lt;td&gt;455 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;15.086&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;31391&lt;/td&gt;
&lt;td&gt;8548 ns&lt;/td&gt;
&lt;td&gt;564 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;15.156&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50432&lt;/td&gt;
&lt;td&gt;13897 ns&lt;/td&gt;
&lt;td&gt;875 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;15.882&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;And so across the board there are speedups up to &lt;em&gt;&lt;strong&gt;x15.8!&lt;/strong&gt;&lt;/em&gt; before dipping down a bit for the more larger array sizes potentially due to the accumulation of cache misses with such large amounts of data. The algorithm reaches out to either end of a potentially massive array which lends itself to an accumulation of cache misses at some point. Still a &lt;em&gt;very&lt;/em&gt; large and significant speedup over &lt;code&gt;std::reverse&lt;/code&gt; consistantly without trying to do some &lt;code&gt;_mm_prefetch&lt;/code&gt; arithmetic to get the cache to behave.&lt;/p&gt;
&lt;h1&gt;
  
  
  SIMD
&lt;/h1&gt;

&lt;p&gt;The &lt;code&gt;bswap&lt;/code&gt; instruction can reverse the byte-order of &lt;code&gt;2&lt;/code&gt;, &lt;code&gt;4&lt;/code&gt;, or &lt;code&gt;8&lt;/code&gt; bytes, but several x86 extensions later and now it is possible to swap the byte order of &lt;code&gt;16&lt;/code&gt;, &lt;code&gt;32&lt;/code&gt;, even &lt;code&gt;64&lt;/code&gt; bytes all at once through the use of &lt;code&gt;SIMD&lt;/code&gt;. &lt;code&gt;SIMD&lt;/code&gt; stands for &lt;em&gt;Single Instruction Multiple Data&lt;/em&gt; and allows operation upon multiple lanes of data in parallel using only a single instruction. Much like &lt;code&gt;bswap&lt;/code&gt; which atomically reverses all four bytes in a register, &lt;code&gt;SIMD&lt;/code&gt; provides an entire instruction set of arithmetic that allows manipulation of multiple instances of data at once in parallel using a single instruction. These chunks of data that are operated upon tend to be called &lt;code&gt;vectors&lt;/code&gt; of data. Multiple bytes of data can then be elevated into a &lt;code&gt;vector&lt;/code&gt; register to reverse its order and place it on the opposite end similarly to the &lt;code&gt;bswap&lt;/code&gt; implementation but with even larger chunks.&lt;/p&gt;

&lt;p&gt;These additions to the algorithm will span higher-width chunks of bytes and will be append above the chain of &lt;code&gt;bswap&lt;/code&gt; accelerated swap-loops to esure that the largest swaps are exhausted first before the smaller ones. Over the years the x86 architecture has seen many generations of &lt;code&gt;SIMD&lt;/code&gt; implementations, improvements, and instruciton sets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;MMX&lt;/code&gt; (1996)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SSE&lt;/code&gt; (1999)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SSE2&lt;/code&gt; (2001)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SSE3&lt;/code&gt; (2004)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SSSE3&lt;/code&gt; (2006)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SSE4 a/1/2&lt;/code&gt; (2006)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;AVX&lt;/code&gt; (2008)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;AVX2&lt;/code&gt; (2013)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;AVX512&lt;/code&gt; (2015)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some are kept around for compatibilities sake(&lt;code&gt;MMX&lt;/code&gt;) and some are so recent, elusive, or so &lt;em&gt;vendor-specific&lt;/em&gt; to Intel or AMD that you're probably not likely to have a processor that features it(&lt;code&gt;SSE4a&lt;/code&gt;). Some are very specific to enterprise hardware (such as &lt;code&gt;AVX512&lt;/code&gt;) and are not likely to be on consumer hardware either. Other architectures may also have their own implementation of SIMD such as ARM's simd co-processor &lt;code&gt;NEON&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;At the moment the steam hardware survey states that &lt;strong&gt;94.42%&lt;/strong&gt; of all CPUs on Steam feature &lt;code&gt;SSSE3&lt;/code&gt;(&lt;a href="http://store.steampowered.com/hwsurvey/" rel="noopener noreferrer"&gt;store.steampowered.com/hwsurvey/&lt;/a&gt;). &lt;code&gt;SSSE3&lt;/code&gt; is what will be used as first step into higher &lt;code&gt;SIMD&lt;/code&gt; territory. &lt;code&gt;SSSE3&lt;/code&gt; in particular due to its &lt;code&gt;_mm_shuffle_epi8&lt;/code&gt; instruction which lets allows the processor to &lt;em&gt;shuffle&lt;/em&gt; bytes within our 128-bit register with relative ease for illustrating this implementation.&lt;/p&gt;
&lt;h1&gt;
  
  
  SSSE3
&lt;/h1&gt;

&lt;p&gt;&lt;code&gt;SSE&lt;/code&gt; stands for "Streaming SIMD Extensions" while &lt;code&gt;SSSE3&lt;/code&gt; stands for "Supplemental Streaming SIMD Extensions 3" which is the &lt;em&gt;fourth&lt;/em&gt; iteration of &lt;code&gt;SSE&lt;/code&gt; technology. SSE introduces registers that allow for some &lt;code&gt;128&lt;/code&gt; bit vector arithmetic. In C or C++ code the intent to use these registers is represented using types such as &lt;code&gt;__m128i&lt;/code&gt; or &lt;code&gt;__m128d&lt;/code&gt; which tell the compiler that any notion of &lt;em&gt;storage&lt;/em&gt; for these types should find their place within the 128-bit &lt;code&gt;SSE&lt;/code&gt; registers when ever possible. Intrinsics such as &lt;code&gt;_mm_add_epi8&lt;/code&gt; which will add two &lt;code&gt;__m128i&lt;/code&gt;s together, and treat them as a &lt;em&gt;vector&lt;/em&gt; of 8-bit elements are now available within C and C++ code. The &lt;code&gt;i&lt;/code&gt; and &lt;code&gt;d&lt;/code&gt; found in &lt;code&gt;__m128i&lt;/code&gt; and &lt;code&gt;__m128d&lt;/code&gt; are to notify intent of the 128-register's interpretation as &lt;code&gt;i&lt;/code&gt;nteger and &lt;code&gt;d&lt;/code&gt;ouble respectively. &lt;code&gt;__m128&lt;/code&gt; is assumed to be a vector of four &lt;code&gt;floats&lt;/code&gt; by default. Since integer-data is what is being operated upon, the &lt;code&gt;__m128i&lt;/code&gt; data type will be used as our data representation which gives access to the &lt;code&gt;_mm_shuffle_epi8&lt;/code&gt; instruction. Note that &lt;code&gt;SSSE3&lt;/code&gt; requires the gcc compile flag &lt;code&gt;-mssse3&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Now to draft a &lt;code&gt;SSSE3&lt;/code&gt; byte swapping implementation and create a simulated 16-byte &lt;code&gt;bswap&lt;/code&gt; using &lt;code&gt;SSSE3&lt;/code&gt;. First, add &lt;code&gt;#include &amp;lt;tmmintrin.h&amp;gt;&lt;/code&gt; in C or C++ code to expose every intrinsic from &lt;code&gt;MMX&lt;/code&gt; up until &lt;code&gt;SSSE3&lt;/code&gt; to your current source file. Then, use the instrinsic &lt;code&gt;_mm_loadu_si128&lt;/code&gt; to &lt;code&gt;load&lt;/code&gt; an &lt;code&gt;u&lt;/code&gt;naligned &lt;code&gt;s&lt;/code&gt;igned &lt;code&gt;i&lt;/code&gt;nteger vector of &lt;code&gt;128&lt;/code&gt; bits into a &lt;code&gt;__m128i&lt;/code&gt; variable. At a hardware level, &lt;em&gt;unaligned&lt;/em&gt; data and &lt;em&gt;aligned&lt;/em&gt; data interfaces with the memory hardware slightly differently and can provide for some further slight speedups should data-alignment be guarenteed. No assumption can be made about the alignment of the pointers being passed to qReverse so unaligned memory access will be used. When done with the vector-arithmetic, call an equivalent &lt;code&gt;_mm_storeu_si128&lt;/code&gt; which stores the vector data into an unaligned memory address. This &lt;code&gt;SSSE3&lt;/code&gt; implementation will go right above the previous &lt;code&gt;Swap64&lt;/code&gt; implementation, ensuring that our algorithm exhausts as much of the larger chunks as possible before resorting to the smaller ones:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;

&lt;span class="cp"&gt;#include&lt;/span&gt; &lt;span class="cpf"&gt;&amp;lt;tmmintrin.h&amp;gt;&lt;/span&gt;&lt;span class="cp"&gt;
&lt;/span&gt;
&lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;__m128i&lt;/span&gt; &lt;span class="n"&gt;ShuffleRev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm_set_epi8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// Load 16 elements at once into one 16-byte register&lt;/span&gt;
    &lt;span class="n"&gt;__m128i&lt;/span&gt; &lt;span class="n"&gt;Lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm_loadu_si128&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;__m128i&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;__m128i&lt;/span&gt; &lt;span class="n"&gt;Upper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm_loadu_si128&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;__m128i&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Reverse the byte order of our 16-byte vectors&lt;/span&gt;
    &lt;span class="n"&gt;Lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm_shuffle_epi8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Lower&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ShuffleRev&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;Upper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm_shuffle_epi8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Upper&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ShuffleRev&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Place them at their swapped position&lt;/span&gt;
    &lt;span class="n"&gt;_mm_storeu_si128&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;__m128i&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
        &lt;span class="n"&gt;Upper&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;_mm_storeu_si128&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;__m128i&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
        &lt;span class="n"&gt;Lower&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// 16 elements at a time&lt;/span&gt;
    &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;// Right above the Swap64 implementation...&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;...&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;This basically implements a beefed-up 16-byte &lt;code&gt;bswap&lt;/code&gt; using &lt;code&gt;SSSE3&lt;/code&gt;. The heart of it all is the &lt;code&gt;_mm_shuffle_epi8&lt;/code&gt; instruction which &lt;em&gt;shuffles&lt;/em&gt; the vector in the first argument according to the vector of byte-indices found in the second argument and returns this new &lt;em&gt;shuffled&lt;/em&gt; vector. A constant vector &lt;code&gt;ShuffleRev&lt;/code&gt; is declared using &lt;code&gt;_mm_set_epi8&lt;/code&gt; with each byte set to the index of where it should get its byte from(starting from least significant byte). You might read it as going from 0 to 15 in ascending order but this is actually indexing the bytes in reverse order which gives a fully reversed 16-bit sub-array of bytes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr4ctzy3i4ggo3t3aojms.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr4ctzy3i4ggo3t3aojms.gif"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now for some speed tests.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Element Count&lt;/th&gt;
&lt;th&gt;std::reverse&lt;/th&gt;
&lt;th&gt;qReverse&lt;/th&gt;
&lt;th&gt;Speedup Factor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;19 ns&lt;/td&gt;
&lt;td&gt;19 ns&lt;/td&gt;
&lt;td&gt;&lt;em&gt;1.000&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;22 ns&lt;/td&gt;
&lt;td&gt;20 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.100&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;27 ns&lt;/td&gt;
&lt;td&gt;20 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.350&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;37 ns&lt;/td&gt;
&lt;td&gt;21 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.762&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;td&gt;55 ns&lt;/td&gt;
&lt;td&gt;23 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.391&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;256&lt;/td&gt;
&lt;td&gt;91 ns&lt;/td&gt;
&lt;td&gt;24 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.792&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;td&gt;158 ns&lt;/td&gt;
&lt;td&gt;31 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5.097&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;297 ns&lt;/td&gt;
&lt;td&gt;42 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7.071&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;43 ns&lt;/td&gt;
&lt;td&gt;21 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.048&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1000&lt;/td&gt;
&lt;td&gt;291 ns&lt;/td&gt;
&lt;td&gt;42 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;6.929&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10000&lt;/td&gt;
&lt;td&gt;2743 ns&lt;/td&gt;
&lt;td&gt;261 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;10.510&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100000&lt;/td&gt;
&lt;td&gt;27523 ns&lt;/td&gt;
&lt;td&gt;2966 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9.280&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1000000&lt;/td&gt;
&lt;td&gt;279812 ns&lt;/td&gt;
&lt;td&gt;33832 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8.271&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;59&lt;/td&gt;
&lt;td&gt;32 ns&lt;/td&gt;
&lt;td&gt;21 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.524&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;79&lt;/td&gt;
&lt;td&gt;43 ns&lt;/td&gt;
&lt;td&gt;24 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.792&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;173&lt;/td&gt;
&lt;td&gt;62 ns&lt;/td&gt;
&lt;td&gt;29 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.138&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6133&lt;/td&gt;
&lt;td&gt;1683 ns&lt;/td&gt;
&lt;td&gt;185 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9.097&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10177&lt;/td&gt;
&lt;td&gt;2787 ns&lt;/td&gt;
&lt;td&gt;291 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9.577&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25253&lt;/td&gt;
&lt;td&gt;6862 ns&lt;/td&gt;
&lt;td&gt;712 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9.638&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;31391&lt;/td&gt;
&lt;td&gt;8546 ns&lt;/td&gt;
&lt;td&gt;916 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9.330&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50432&lt;/td&gt;
&lt;td&gt;13893 ns&lt;/td&gt;
&lt;td&gt;1497 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9.281&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Speedups of up to &lt;em&gt;&lt;strong&gt;x10.5&lt;/strong&gt;&lt;/em&gt;!... but this is lower than the &lt;code&gt;bswap&lt;/code&gt; version which reached up to &lt;em&gt;&lt;strong&gt;x15.8&lt;/strong&gt;&lt;/em&gt;? Maybe some loop unrolling or some prefetching might help this algorithm play nice with the cache. In this implementation only two out of the available 8 registers are being used as well so there is some great room for improvement(Todo)&lt;/p&gt;
&lt;h1&gt;
  
  
  AVX2
&lt;/h1&gt;

&lt;p&gt;The implementation can go even further to work with the even larger 256-bit registers that the &lt;code&gt;AVX/AVX2&lt;/code&gt; extension provides and reverse &lt;em&gt;32 byte chunks&lt;/em&gt; at a time. The implementation is very similar to the &lt;code&gt;SSSE3&lt;/code&gt; one: load in &lt;em&gt;unaligned&lt;/em&gt; data into a 256-bit register using the &lt;code&gt;__m256i&lt;/code&gt; type. The issue with &lt;code&gt;AVX/AVX2&lt;/code&gt; is that the &lt;code&gt;256-bit&lt;/code&gt; register is actually two individual &lt;code&gt;128-bit&lt;/code&gt; &lt;em&gt;lanes&lt;/em&gt; being operated in parallel as one larger &lt;code&gt;256-bit&lt;/code&gt; register and overlaps in functionality with the &lt;code&gt;SSE&lt;/code&gt; register almost as an additional layer of abstraction added upon &lt;code&gt;SSE&lt;/code&gt;. Now here's where things get tricky, there is no &lt;code&gt;_mm256_shuffle_epi8&lt;/code&gt; instruction that works like you'd think it would. Since it's just operating on two 128-bit lanes in parallel, &lt;code&gt;AVX/AVX2&lt;/code&gt; instructions introduces a limitation in which some cross-lane arithmetic requires special cross-lane attention. Some instructions will accept 256-bit &lt;code&gt;AVX&lt;/code&gt; registers but only actually operates upon 128-bit lanes. The trick here is that rather than trying to reverse a 256-bit register atomically in one go, instead reverse the bytes within the two 128-bit lanes, as if shuffling two 128-bit registers like in the &lt;code&gt;SSSE3&lt;/code&gt; implementation, and then reverse the two 128-bit lanes themselves with whatever cross-lane arithmetic that &lt;em&gt;is&lt;/em&gt; available in &lt;code&gt;AVX/AVX2&lt;/code&gt;. Note that &lt;code&gt;AVX2&lt;/code&gt; requires the gcc compile flag &lt;code&gt;-mavx2&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX,AVX2&amp;amp;text=shuffle&amp;amp;expand=4726" rel="noopener noreferrer"&gt;_mm256_shuffle_epi8&lt;/a&gt; is an &lt;code&gt;AVX2&lt;/code&gt; instruction that shuffles the two 128-bit lanes of the 256-bit register much like the &lt;code&gt;SSSE3&lt;/code&gt; intrinsic so this can be taken care of first.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;

&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;__m256i&lt;/span&gt; &lt;span class="n"&gt;ShuffleRev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm256_set_epi8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// first 128-bit lane&lt;/span&gt;
    &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;  &lt;span class="c1"&gt;// second 128-bit lane&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// Load 32 elements at once into one 32-byte register&lt;/span&gt;
&lt;span class="n"&gt;__m256i&lt;/span&gt; &lt;span class="n"&gt;Lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm256_loadu_si256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;__m256i&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;__m256i&lt;/span&gt; &lt;span class="n"&gt;Upper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm256_loadu_si256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;__m256i&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Reverse each the bytes in each 128-bit lane&lt;/span&gt;
&lt;span class="n"&gt;Lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm256_shuffle_epi8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Lower&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;ShuffleRev&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;Upper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm256_shuffle_epi8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Upper&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;ShuffleRev&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;So in the large 256-bit register, the two 128-bit lanes are now reversed, but now the 128-bit lanes themselves must be reversed. &lt;a href="https://software.intel.com/en-us/node/524015" rel="noopener noreferrer"&gt;_mm256_permute2x128_si256&lt;/a&gt; is another &lt;code&gt;AVX2&lt;/code&gt; instruction that permutes the 128-bit lanes of two 256-bit registers:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F523aiguilgvhqy5axndq.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F523aiguilgvhqy5axndq.jpg"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Given two big 256-bit vectors and an 8-byte immediate value it can select how the new 256-bit vector it builds is going to be assembled. Pass in the same variable for both of the arguments and "picking" from them as if they were just 2-element arrays of 16-byte elements can simulate a big 128-bit cross-lane swap(think of it like &lt;code&gt;__m128i SomeAVXRegister[2]&lt;/code&gt; and going &lt;code&gt;SomeAVXRegister[0]&lt;/code&gt; or &lt;code&gt;SomeAVXRegister[1]&lt;/code&gt; ). In a way, this is also a big 128-bit "rotate" if you can visualize it.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;

&lt;span class="p"&gt;...&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;__m256i&lt;/span&gt; &lt;span class="n"&gt;ShuffleRev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm256_set_epi8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// Load 32 elements at once into one 32-byte register&lt;/span&gt;
    &lt;span class="n"&gt;__m256i&lt;/span&gt; &lt;span class="n"&gt;Lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm256_loadu_si256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;__m256i&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;__m256i&lt;/span&gt; &lt;span class="n"&gt;Upper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm256_loadu_si256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;__m256i&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Reverse the bytes inside each of the two 16-byte lanes&lt;/span&gt;
    &lt;span class="n"&gt;Lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm256_shuffle_epi8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Lower&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;ShuffleRev&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;Upper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm256_shuffle_epi8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Upper&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;ShuffleRev&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Reverse the order of the 16-byte lanes&lt;/span&gt;
    &lt;span class="n"&gt;Lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm256_permute2x128_si256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Lower&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;Lower&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;Upper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm256_permute2x128_si256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Upper&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;Upper&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Place them at their swapped position&lt;/span&gt;
    &lt;span class="n"&gt;_mm256_storeu_si256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;__m256i&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
        &lt;span class="n"&gt;Upper&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;_mm256_storeu_si256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;__m256i&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
        &lt;span class="n"&gt;Lower&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// 32 elements at a time&lt;/span&gt;
    &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;// Right above the SSSE3 implementation&lt;/span&gt;
&lt;span class="p"&gt;...&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F63x1lty2cqbbjmr790yr.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F63x1lty2cqbbjmr790yr.gif"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Benchmarks:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Element Count&lt;/th&gt;
&lt;th&gt;std::reverse&lt;/th&gt;
&lt;th&gt;qReverse&lt;/th&gt;
&lt;th&gt;Speedup Factor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;19 ns&lt;/td&gt;
&lt;td&gt;19 ns&lt;/td&gt;
&lt;td&gt;&lt;em&gt;1.000&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;21 ns&lt;/td&gt;
&lt;td&gt;21 ns&lt;/td&gt;
&lt;td&gt;&lt;em&gt;1.000&lt;/em&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;26 ns&lt;/td&gt;
&lt;td&gt;20 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.300&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;37 ns&lt;/td&gt;
&lt;td&gt;22 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.682&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;td&gt;54 ns&lt;/td&gt;
&lt;td&gt;20 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.700&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;256&lt;/td&gt;
&lt;td&gt;91 ns&lt;/td&gt;
&lt;td&gt;24 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.792&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;td&gt;159 ns&lt;/td&gt;
&lt;td&gt;27 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5.889&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;298 ns&lt;/td&gt;
&lt;td&gt;36 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8.278&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;45 ns&lt;/td&gt;
&lt;td&gt;20 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.250&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1000&lt;/td&gt;
&lt;td&gt;292 ns&lt;/td&gt;
&lt;td&gt;36 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8.111&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10000&lt;/td&gt;
&lt;td&gt;2739 ns&lt;/td&gt;
&lt;td&gt;189 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;14.492&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100000&lt;/td&gt;
&lt;td&gt;27515 ns&lt;/td&gt;
&lt;td&gt;1714 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;16.053&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1000000&lt;/td&gt;
&lt;td&gt;279701 ns&lt;/td&gt;
&lt;td&gt;25417 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;11.004&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;59&lt;/td&gt;
&lt;td&gt;32 ns&lt;/td&gt;
&lt;td&gt;21 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.524&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;79&lt;/td&gt;
&lt;td&gt;44 ns&lt;/td&gt;
&lt;td&gt;25 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.760&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;173&lt;/td&gt;
&lt;td&gt;63 ns&lt;/td&gt;
&lt;td&gt;29 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.172&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6133&lt;/td&gt;
&lt;td&gt;1681 ns&lt;/td&gt;
&lt;td&gt;127 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;13.236&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10177&lt;/td&gt;
&lt;td&gt;2782 ns&lt;/td&gt;
&lt;td&gt;192 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;14.490&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25253&lt;/td&gt;
&lt;td&gt;6863 ns&lt;/td&gt;
&lt;td&gt;449 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;15.285&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;31391&lt;/td&gt;
&lt;td&gt;8545 ns&lt;/td&gt;
&lt;td&gt;556 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;15.369&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50432&lt;/td&gt;
&lt;td&gt;13888 ns&lt;/td&gt;
&lt;td&gt;875 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;15.872&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A speedup of up to &lt;em&gt;&lt;strong&gt;x16.053&lt;/strong&gt;&lt;/em&gt;!&lt;/p&gt;
&lt;h1&gt;
  
  
  AVX512
&lt;/h1&gt;

&lt;p&gt;&lt;code&gt;AVX512&lt;/code&gt; is particularly rare out in the commercial world. Even so, the algorithm can take that much more of a step forward and operate upon massive 512-bit bit registers. This will allow a swap of &lt;code&gt;64&lt;/code&gt; bytes of data at once. At the moment, C and C++ compiler implementations of the &lt;code&gt;AVX512&lt;/code&gt; instruction set are spotty at best. There is the benefit of the &lt;a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=4729,4753&amp;amp;text=_mm512_shuffle_epi8" rel="noopener noreferrer"&gt;_mm512_shuffle_epi8&lt;/a&gt; instruction that will allow the shuffling of the four 128-lane registers within the 512-bit register with 8-bit indexes though there is not a confident implementation of &lt;a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#=undefined&amp;amp;avx512techs=AVX512F&amp;amp;expand=4726,4038,4594,4594,4618,4618&amp;amp;text=_mm512_set_epi8" rel="noopener noreferrer"&gt;_mm512_set_epi8&lt;/a&gt; to be found in MSVC or GCC. There is &lt;a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#=undefined&amp;amp;expand=4726,4038,4594,4594,4618,4618,4618,4611&amp;amp;text=_mm512_set_epi32&amp;amp;avx512techs=AVX512F" rel="noopener noreferrer"&gt;_mm512_set_epi32&lt;/a&gt; which will require generation of the &lt;code&gt;ShuffleRev&lt;/code&gt; constant to use 32-bit integers rather than 8-bit integers. After the initial &lt;code&gt;_mm512_shuffle_epi8&lt;/code&gt; the four lanes still must be reversed due to the need for cross-lane arithmetic so an additional &lt;a href="https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=4726,4594,4729,3972,4029,4029,3930,4729,3930,4029,4589,4029,4029,4047,4047&amp;amp;avx512techs=AVX512F&amp;amp;text=_mm512_permutexvar_epi64" rel="noopener noreferrer"&gt;_mm512_permutexvar_epi64&lt;/a&gt; is needed to truely complete the reversal. This is similar to what had to be done for the &lt;code&gt;AVX2&lt;/code&gt; implementation above.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;AVX512&lt;/code&gt; is not a single set of instructions. &lt;code&gt;AVX512&lt;/code&gt; has different subsets which may or may not be implemented for a particular processor. For example there is &lt;code&gt;AVX512CD&lt;/code&gt; for conflict detection and &lt;code&gt;AVX512ER&lt;/code&gt; for exponential and reciprocal instructions though ALL implementations of &lt;code&gt;AVX512&lt;/code&gt; &lt;em&gt;require&lt;/em&gt; that &lt;code&gt;AVX512F&lt;/code&gt;(AVX-512 Foundation) be implemented. &lt;code&gt;_mm512_shuffle_epi8&lt;/code&gt; is an instruction implemented by the &lt;code&gt;AVX512BW&lt;/code&gt; subset which adds &lt;strong&gt;B&lt;/strong&gt;yte and &lt;strong&gt;W&lt;/strong&gt;ord operations while &lt;code&gt;_mm512_setepi32&lt;/code&gt; and &lt;code&gt;_mm512_permutexvar_epi64&lt;/code&gt; are &lt;code&gt;AVX512F&lt;/code&gt; so &lt;code&gt;_mm512_shuffle_epi8&lt;/code&gt; is the only "stretch" requirement involved here. &lt;code&gt;AVX512BW&lt;/code&gt; is currently supported by the current Skylake-X processors and is planned to be implemented more commonly in future processors. For reference, a current map of &lt;code&gt;AVX512&lt;/code&gt; subset implementations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpyks3t92h7dfyylbag5o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpyks3t92h7dfyylbag5o.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In GCC, specific subsets of &lt;code&gt;AVX512&lt;/code&gt; must be enabled using compile flags:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Subset&lt;/th&gt;
&lt;th&gt;Flag&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Foundation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;-mavx512f&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prefetch&lt;/td&gt;
&lt;td&gt;&lt;code&gt;-mavx512pf&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Exponential/Reciprocal&lt;/td&gt;
&lt;td&gt;&lt;code&gt;-mavx512er&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Conflict Detection&lt;/td&gt;
&lt;td&gt;&lt;code&gt;-mavx512cd&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vector Length&lt;/td&gt;
&lt;td&gt;&lt;code&gt;-mavx512vl&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Byte And Word&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;-mavx512bw&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Doubleword and Quadword&lt;/td&gt;
&lt;td&gt;&lt;code&gt;-mavx512dq&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;Integer Fused Multiply Add&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;-mavx512ifma&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;em&gt;Vector Byte Manipulation&lt;/em&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;-mavx512vbmi&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;_mm512_shuffle_epi8&lt;/code&gt; and requires the &lt;code&gt;-mavx512bw&lt;/code&gt; flag to compile in gcc while the rest only requires &lt;code&gt;-mavx512f&lt;/code&gt;.&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;

&lt;span class="c1"&gt;// Could have done:&lt;/span&gt;
&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;__m512i&lt;/span&gt; &lt;span class="n"&gt;ShuffleRev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm512_set_epi8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// but instead have to do the more awkward:&lt;/span&gt;
&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;__m512i&lt;/span&gt; &lt;span class="n"&gt;ShuffleRev&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm512_set_epi32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="mh"&gt;0x00010203&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x4050607&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x8090a0b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0xc0d0e0f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="mh"&gt;0x00010203&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x4050607&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x8090a0b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0xc0d0e0f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="mh"&gt;0x00010203&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x4050607&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x8090a0b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0xc0d0e0f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="mh"&gt;0x00010203&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x4050607&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x8090a0b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0xc0d0e0f&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;The full &lt;code&gt;AVX512&lt;/code&gt; implementation:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;

&lt;span class="p"&gt;...&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Reverses the 16 bytes of the four  128-bit lanes in a 512-bit register&lt;/span&gt;
        &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;__m512i&lt;/span&gt; &lt;span class="n"&gt;ShuffleRev8&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm512_set_epi32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="mh"&gt;0x00010203&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x4050607&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x8090a0b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0xc0d0e0f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="mh"&gt;0x00010203&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x4050607&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x8090a0b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0xc0d0e0f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="mh"&gt;0x00010203&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x4050607&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x8090a0b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0xc0d0e0f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="mh"&gt;0x00010203&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x4050607&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0x8090a0b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mh"&gt;0xc0d0e0f&lt;/span&gt;
        &lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;// Reverses the four 128-bit lanes of a 512-bit register&lt;/span&gt;
        &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;__m512i&lt;/span&gt; &lt;span class="n"&gt;ShuffleRev64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm512_set_epi64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;6&lt;/span&gt;
        &lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;// Load 64 elements at once into one 64-byte register&lt;/span&gt;
        &lt;span class="n"&gt;__m512i&lt;/span&gt; &lt;span class="n"&gt;Lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm512_loadu_si512&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;__m512i&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;__m512i&lt;/span&gt; &lt;span class="n"&gt;Upper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm512_loadu_si512&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;__m512i&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;// Reverse the byte order of each 128-bit lane&lt;/span&gt;
        &lt;span class="n"&gt;Lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm512_shuffle_epi8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Lower&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;ShuffleRev8&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;Upper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm512_shuffle_epi8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Upper&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;ShuffleRev8&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;// Reverse the four 128-bit lanes in the 512-bit register&lt;/span&gt;
        &lt;span class="n"&gt;Lower&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm512_permutexvar_epi64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ShuffleRev64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;Lower&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;Upper&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm512_permutexvar_epi64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ShuffleRev64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;Upper&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;// Place them at their swapped position&lt;/span&gt;
        &lt;span class="n"&gt;_mm512_storeu_si512&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;__m512i&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
            &lt;span class="n"&gt;Upper&lt;/span&gt;
        &lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;_mm512_storeu_si512&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="k"&gt;reinterpret_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;__m512i&lt;/span&gt;&lt;span class="o"&gt;*&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Array8&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Count&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
            &lt;span class="n"&gt;Lower&lt;/span&gt;
        &lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;// 64 elements at a time&lt;/span&gt;
        &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="c1"&gt;// Above the AVX2 implementation&lt;/span&gt;
&lt;span class="p"&gt;...&lt;/span&gt;


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fopujtpvsm6x3ya0i3vac.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fopujtpvsm6x3ya0i3vac.gif"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Since &lt;code&gt;AVX512&lt;/code&gt; is pretty rare on consumer hardware at the moment: &lt;a href="https://software.intel.com/en-us/articles/intel-software-development-emulator" rel="noopener noreferrer"&gt;Intel provides an emulator&lt;/a&gt; that can provide for some verification that the algorithm properly reverses the array. The emulator is no grounds for a proper hardware benchmark though. After creating a simple test program to verify that the array has been reversed it can be ran through the emulator and verified:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;sde64 -mix -no_shared_libs  -- ./Verify1 128&lt;/code&gt;&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

Original:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
Reversed:
127 126 125 124 123 122 121 120 119 118 117 116 115 114 113 112 111 110 109 108 107 106 105 104 103 102 101 100 99 98 97 96 95 94 93 92 91 90 89 88 87 86 85 84 83 82 81 80 79 78 77 76 75 74 73 72 71 70 69 68 67 66 65 64 63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
[PASS] Array Reversed


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;The &lt;code&gt;-mix&lt;/code&gt; will cause the software development emulator to audit the execution of the program and the instructions it encounters into a &lt;code&gt;sde-mix-out.txt&lt;/code&gt; file. This file is massive by default so &lt;code&gt;-no_shared_libs&lt;/code&gt; removes the auditing of shared libraries(such as the standard libraries) from the report. With this the execution summery of &lt;code&gt;qreverse&amp;lt;1&amp;gt;&lt;/code&gt; can be examined:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

# $dynamic-counts-for-function: void qReverse&amp;lt;1ul&amp;gt;(void*, unsigned long)  IMG: /qreverse/build/Verify1 at [0x4e1ffd40a0, 0x4e1ffd4e3e)   1.442%
#
# TID 0
#       opcode                 count
#
...
*isa-set-AVX512BW_512                                                  4
*isa-set-AVX512F_512                                                   6
...
*category-AVX512                                                       4
...
*avx512                                                               10
...
...
VMOVDQA64                                                              2  &amp;lt; Note these are called only twice which
VMOVDQU64                                                              2    makes sense given a 128-byte array
VMOVDQU8                                                               2    and only needing one AVX-512 swap
VPERMQ                                                                 2
VPSHUFB                                                                2
*total                                                                75



&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Not only are the &lt;code&gt;AVX512BW&lt;/code&gt; instructions verified to have ran and worked but the entire reversal of &lt;code&gt;128&lt;/code&gt; elements took only &lt;code&gt;75&lt;/code&gt; instructions in total!&lt;/p&gt;

&lt;p&gt;I recently acquired an Intel i9-7900x &lt;code&gt;BX80673I97900X&lt;/code&gt; which features a large portion of the &lt;code&gt;AVX512&lt;/code&gt; subsets and is capable of providing some actual hardware benchmarks for this implementation.&lt;br&gt;
Here the benchmark is compiled using Visual Studio 2017 for x64-Release mode. &lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Element Count&lt;/th&gt;
&lt;th&gt;std::reverse&lt;/th&gt;
&lt;th&gt;qReverse&lt;/th&gt;
&lt;th&gt;Speedup Factor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;63 ns&lt;/td&gt;
&lt;td&gt;59 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.068&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;62 ns&lt;/td&gt;
&lt;td&gt;61 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.016&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;74 ns&lt;/td&gt;
&lt;td&gt;59 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.254&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;104 ns&lt;/td&gt;
&lt;td&gt;61 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.705&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;128&lt;/td&gt;
&lt;td&gt;80 ns&lt;/td&gt;
&lt;td&gt;18 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.444&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;256&lt;/td&gt;
&lt;td&gt;90 ns&lt;/td&gt;
&lt;td&gt;20 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;4.500&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;512&lt;/td&gt;
&lt;td&gt;157 ns&lt;/td&gt;
&lt;td&gt;22 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7.136&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;276 ns&lt;/td&gt;
&lt;td&gt;28 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9.857&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;44 ns&lt;/td&gt;
&lt;td&gt;20 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.200&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1000&lt;/td&gt;
&lt;td&gt;269 ns&lt;/td&gt;
&lt;td&gt;30 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;8.967&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10000&lt;/td&gt;
&lt;td&gt;2504 ns&lt;/td&gt;
&lt;td&gt;112 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;22.357&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100000&lt;/td&gt;
&lt;td&gt;24222 ns&lt;/td&gt;
&lt;td&gt;1368 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;17.706&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1000000&lt;/td&gt;
&lt;td&gt;236354 ns&lt;/td&gt;
&lt;td&gt;21192 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;11.153&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;59&lt;/td&gt;
&lt;td&gt;33 ns&lt;/td&gt;
&lt;td&gt;21 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.571&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;79&lt;/td&gt;
&lt;td&gt;40 ns&lt;/td&gt;
&lt;td&gt;23 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;1.739&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;173&lt;/td&gt;
&lt;td&gt;58 ns&lt;/td&gt;
&lt;td&gt;29 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.000&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6133&lt;/td&gt;
&lt;td&gt;1438 ns&lt;/td&gt;
&lt;td&gt;93 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;15.462&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10177&lt;/td&gt;
&lt;td&gt;2481 ns&lt;/td&gt;
&lt;td&gt;147 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;16.878&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;25253&lt;/td&gt;
&lt;td&gt;5794 ns&lt;/td&gt;
&lt;td&gt;332 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;17.452&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;31391&lt;/td&gt;
&lt;td&gt;7397 ns&lt;/td&gt;
&lt;td&gt;420 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;17.612&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50432&lt;/td&gt;
&lt;td&gt;11915 ns&lt;/td&gt;
&lt;td&gt;858 ns&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;13.887&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A plateau of speedups up to &lt;em&gt;&lt;strong&gt;x22.357&lt;/strong&gt;&lt;/em&gt;!&lt;/p&gt;
&lt;h1&gt;
  
  
  Misc
&lt;/h1&gt;

&lt;p&gt;Once we work our way down the middle and end up with something like &lt;code&gt;4&lt;/code&gt; "middle" elements left then we are just one &lt;code&gt;Swap32&lt;/code&gt; left from having the entire array reversed. What if we worked our way down to the middle and ended up with &lt;code&gt;5&lt;/code&gt; elements though? This would not be possible actually so long as we have &lt;code&gt;Swap16&lt;/code&gt;. &lt;code&gt;5&lt;/code&gt; middle elements would mean we have &lt;code&gt;one middle element&lt;/code&gt; with &lt;code&gt;two elements on either side&lt;/code&gt;. Our &lt;code&gt;for( std::size_t j = i/2; j &amp;lt; ( (Count/2) / 2)&lt;/code&gt; would catch that and &lt;code&gt;Swap16&lt;/code&gt; the two elements on either side, getting us just &lt;code&gt;1&lt;/code&gt; element left right in the middle which can stay right where it is within a reversed array(since the middle-most element in an odd-numbered array is our &lt;em&gt;pivot&lt;/em&gt; and doesn't have to move anywhere).&lt;/p&gt;

&lt;p&gt;Later we can find a way to accelerate our algorithm to have it consider these pivot-cases efficiently so that rather than calling two &lt;code&gt;Swap16&lt;/code&gt;s on either half of a 4-byte case it could just call one last &lt;code&gt;Swap32&lt;/code&gt; or even a bigger before it even parks itself in that situation of having to use the naive swap. Something like this could remove the use of the naive swap pretty much entirely.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev.to%2Fassets%2Fgithub-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/Wunkolo" rel="noopener noreferrer"&gt;
        Wunkolo
      &lt;/a&gt; / &lt;a href="https://github.com/Wunkolo/qreverse" rel="noopener noreferrer"&gt;
        qreverse
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      A small study in hardware accelerated AoS reversal
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;qReverse &lt;a href="https://raw.githubusercontent.com/Wunkolo/qreverse/master/LICENSE" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/6581c31c16c1b13ddc2efb92e2ad69a93ddc4a92fd871ff15d401c4c6c9155a4/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e73652d4d49542d626c75652e737667" alt="GitHub license"&gt;&lt;/a&gt;
&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;qReverse is an architecture-accelerated array reversal algorithm intended as a personal study to design a fast AoS reversal algorithm utilizing SIMD.&lt;/p&gt;

&lt;p&gt;&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;br&gt;
&lt;thead&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;/thead&gt;
&lt;br&gt;
&lt;tbody&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;Serial&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;bswap/rev&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;SSSE3/Neon&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;AVX2&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;AVX512&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;Pattern&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;&lt;a rel="noopener noreferrer" href="https://github.com/Wunkolo/qreverseimages/Serial.gif"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FWunkolo%2Fqreverseimages%2FSerial.gif" alt="Serial"&gt;&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;&lt;a rel="noopener noreferrer" href="https://github.com/Wunkolo/qreverseimages/Swap64.gif"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FWunkolo%2Fqreverseimages%2FSwap64.gif" alt="bswap/rev"&gt;&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;&lt;a rel="noopener noreferrer" href="https://github.com/Wunkolo/qreverseimages/SSSE3.gif"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FWunkolo%2Fqreverseimages%2FSSSE3.gif" alt="SSSE3"&gt;&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;&lt;a rel="noopener noreferrer" href="https://github.com/Wunkolo/qreverseimages/AVX2.gif"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FWunkolo%2Fqreverseimages%2FAVX2.gif" alt="AVX2"&gt;&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;&lt;a rel="noopener noreferrer" href="https://github.com/Wunkolo/qreverseimages/AVX512.gif"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FWunkolo%2Fqreverseimages%2FAVX512.gif" alt="AVX512"&gt;&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;Processor&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;Speedup&lt;/td&gt;


&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;&lt;a href="https://en.wikichip.org/wiki/intel/core_i9/i9-7900x" rel="nofollow noopener noreferrer"&gt;i9-7900x&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x1&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x15.386&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x10.417&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x22.032&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x22.357&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;&lt;a href="https://en.wikichip.org/wiki/intel/core_i3/i3-6100" rel="nofollow noopener noreferrer"&gt;i3-6100&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x1&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x15.8&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x10.5&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x16.053&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;&lt;a href="https://en.wikichip.org/wiki/intel/core_i5/i5-8600k" rel="nofollow noopener noreferrer"&gt;i5-8600K&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x1&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x15.905&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x10.21&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x16.076&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;&lt;a href="https://en.wikichip.org/wiki/intel/core_i5/i5-8600k" rel="nofollow noopener noreferrer"&gt;E5-2697 v4&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x1&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x16.701&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x15.716&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x19.141&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;&lt;a href="https://en.wikipedia.org/wiki/Broadcom_Corporation#Raspberry_Pi" rel="nofollow noopener noreferrer"&gt;BCM2837&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x1&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x7.391&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x7.718&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;/tbody&gt;
&lt;br&gt;
&lt;/table&gt;&lt;/div&gt;&lt;br&gt;
&lt;/p&gt;


&lt;p&gt;Array reversal implementations typically involve swapping both ends of the array and working down to the middle-most elements. C++ being type-aware treats array elements as objects and will call overloaded class operators such as &lt;code&gt;operator=&lt;/code&gt; or a &lt;code&gt;copy by reference&lt;/code&gt; constructor where available. Many implementations of a "swap" function would use an intermediate temporary variable to make the exchange which would require a minimum of two calls to an object's &lt;code&gt;operator=&lt;/code&gt; and at least one call to an object's &lt;code&gt;copy by reference&lt;/code&gt; constructor. Some other novel algorithms use the…&lt;/p&gt;
&lt;/div&gt;


&lt;/div&gt;
&lt;br&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/Wunkolo/qreverse" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;/div&gt;
&lt;br&gt;


</description>
      <category>cpp</category>
      <category>c</category>
      <category>performance</category>
      <category>algorithms</category>
    </item>
    <item>
      <title>Average color of an image, vectorized</title>
      <dc:creator>Wunk</dc:creator>
      <pubDate>Sun, 24 Feb 2019 07:34:00 +0000</pubDate>
      <link>https://dev.to/wunk/average-color-of-an-image-vectorized-4802</link>
      <guid>https://dev.to/wunk/average-color-of-an-image-vectorized-4802</guid>
      <description>&lt;p&gt;This is a little snippet write-up of code that will find the average color of an image of RGBA8 pixels (32-bits per pixel, 8 bits per channel) by utilizing the &lt;code&gt;psadbw&lt;/code&gt;(&lt;code&gt;_mm_sad_epu8&lt;/code&gt;) instruction to accumulate the sum of each individual channel into a (very overflow-safe)64-bit accumulator.&lt;/p&gt;

&lt;p&gt;Inspired by the &lt;a href="http://0x80.pl/notesen/2018-10-24-sse-sumbytes.html" rel="noopener noreferrer"&gt;"SIMDized sum of all bytes in the array" write-up&lt;/a&gt; by &lt;a href="https://twitter.com/pshufb" rel="noopener noreferrer"&gt;Wojciech Muła&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The usual method to get the statistical average of each color channel in an image is pretty trivial:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Load in a pixel&lt;/li&gt;
&lt;li&gt;Unpack the individual color channels&lt;/li&gt;
&lt;li&gt;Sum the channel values into a an accumulator&lt;/li&gt;
&lt;li&gt;When you've summed them all, divide these sums by the number of total pixels&lt;/li&gt;
&lt;li&gt;Interleave these averages into a new color value&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Something like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fl8hyfkk4j3gymb4l23t6.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2Fl8hyfkk4j3gymb4l23t6.gif"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt; &lt;span class="nf"&gt;AverageColorRGBA8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt; &lt;span class="n"&gt;Pixels&lt;/span&gt;&lt;span class="p"&gt;[],&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;Count&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt; &lt;span class="n"&gt;RedSum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GreenSum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BlueSum&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AlphaSum&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;RedSum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;GreenSum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;BlueSum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AlphaSum&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;Count&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;CurColor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Pixels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="n"&gt;AlphaSum&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="k"&gt;static_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;CurColor&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt; &lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;BlueSum&lt;/span&gt;  &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="k"&gt;static_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;CurColor&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt; &lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;GreenSum&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="k"&gt;static_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;CurColor&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;  &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="n"&gt;RedSum&lt;/span&gt;   &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="k"&gt;static_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;CurColor&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;RedSum&lt;/span&gt;   &lt;span class="o"&gt;/=&lt;/span&gt; &lt;span class="n"&gt;Count&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;GreenSum&lt;/span&gt; &lt;span class="o"&gt;/=&lt;/span&gt; &lt;span class="n"&gt;Count&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;BlueSum&lt;/span&gt;  &lt;span class="o"&gt;/=&lt;/span&gt; &lt;span class="n"&gt;Count&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="n"&gt;AlphaSum&lt;/span&gt; &lt;span class="o"&gt;/=&lt;/span&gt; &lt;span class="n"&gt;Count&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;static_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;AlphaSum&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;static_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;BlueSum&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;static_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;GreenSum&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;  &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
        &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;static_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="n"&gt;RedSum&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This is a pretty serial way to do it. Pick up a pixel, unpack it, add it to a sum.&lt;/p&gt;

&lt;p&gt;Each of these unpacks and sums are pretty independent of each other can be parallelized with some SIMD trickery to do these unpacks and sums in chunks of 4, 8, even &lt;strong&gt;16&lt;/strong&gt; pixels at once in parallel.&lt;/p&gt;

&lt;p&gt;There is no dedicated instruction for a horizontal sum of 8-bit elements within a vector register in any of the &lt;a href="https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions#Later_versions" rel="noopener noreferrer"&gt;x86 SIMD variations&lt;/a&gt;.&lt;br&gt;
The closest tautology is an instruction that gets the &lt;strong&gt;S&lt;/strong&gt;um of &lt;strong&gt;A&lt;/strong&gt;bsolute &lt;strong&gt;D&lt;/strong&gt;ifferences of &lt;strong&gt;8&lt;/strong&gt;-bit elements within 64-bit lanes, and then horizontally adds these 8-bit differences into the lower 16-bits of the 64-bit lane. This is basically computing the &lt;a href="https://en.wikipedia.org/wiki/Taxicab_geometry" rel="noopener noreferrer"&gt;manhatten distance&lt;/a&gt; between two vectors of eight 8-bit elements.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AD(A,B) = ABS(A - B) # Absolute difference

X = ( 1, 2, 3, 4, 5, 6, 7, 8) # 8 byte vectors
Y = ( 8, 9,10,11,12,13,14,15)

# Sum of absolute differences
SAD(X,Y) =
    # Absolute difference of each of the pairs of 8-bit elements
    AD(X[0],Y[0]) + AD(X[1],Y[1]) + AD(X[2],Y[2]) + AD(X[3],Y[3]) +
    AD(X[4],Y[4]) + AD(X[5],Y[5]) + AD(X[6],Y[6]) + AD(X[7],Y[7])
    =
    ABS(  1 -  8 ) + ABS(  2 -  9 ) + ABS(  3 - 10 ) + ABS(  4 - 11 ) +
    ABS(  5 - 12 ) + ABS(  6 - 13 ) + ABS(  7 - 14 ) + ABS(  8 - 15 )
    =
    ABS( -7 ) + ABS( -7 ) + ABS( -7 ) + ABS( -7 ) +
    ABS( -7 ) + ABS( -7 ) + ABS( -7 ) + ABS( -7 )
    # Horizontally sum all these differences into a 16-bit value
    = 7 + 7 + 7 + 7 + 7 + 7 + 7 + 7
    = 56
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;code&gt;psadbw&lt;/code&gt; may seem like a pretty niche instruction at first. You're probably wondering why such a specific series of operations is implemented as an official x86 instruction but it has had plenty of usage since the original SSE days to aid in block-based &lt;a href="https://en.wikipedia.org/wiki/Sum_of_absolute_differences" rel="noopener noreferrer"&gt;motion estimation&lt;/a&gt; for video encoding.&lt;br&gt;
The trick here is recognizing that the absolute difference between an &lt;em&gt;unsigned&lt;/em&gt; number and &lt;em&gt;zero&lt;/em&gt;, is just the unsigned number again. The &lt;em&gt;sum&lt;/em&gt; of the absolute difference between a vector of unsigned values and vector-0 is a way to extract just the horizontal-addition step of SAD for this particular use.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(A is unsigned)
AD(A,B) = ABS(A - 0) = A

X = ( 1, 2, 3, 4, 5, 6, 7, 8) # 8 byte vectors
Y = ( 0, 0, 0, 0, 0, 0, 0, 0)

SAD(X,Y) =
    AD(X[0],0) + AD(X[1],0) + AD(X[2],0) + AD(X[3],0) +
    AD(X[4],0) + AD(X[5],0) + AD(X[6],0) + AD(X[7],0)
    =
    X[0] + X[1] + X[2] + X[3] + X[4] + X[5] + X[6] + X[7]
    =
    1 + 2 + 3 + 4 + 5 + 6 + 7
    = 28
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This kind of utilizaton of &lt;code&gt;psadbw&lt;/code&gt; will allow a vector of 8 consecutive bytes to be horizontally summed into the low 16-bits of a 64-bit lane, and this 16-bit value can then be directly added to a larger 64-bit accumulator. With this, a chunk of RGBA color values can be loaded into a vector, unpacked so that all their R,G,B,A bytes are grouped into 64-bit lanes, &lt;code&gt;psadbw&lt;/code&gt; these 64-bit lanes to get 16-bit sums, and then accumulate these sums into a 64-bit accumulator to later get their average.&lt;/p&gt;

&lt;p&gt;Usually, taking the average of a large amount of values can cause some worry for overflow. With instructions like &lt;code&gt;psadbw&lt;/code&gt; that operate on 64-bit lanes, it lends itself to the usage of 64-bit accumulators which are very resistant to overflow.&lt;br&gt;
An individual channel would need &lt;code&gt;2^64 / 2^8 == 72057594037927936&lt;/code&gt; pixels (almost 69 billion megapixels) with a value of &lt;code&gt;0xFF&lt;/code&gt; for that color channel to overflow its 64-bit accumulator.&lt;br&gt;
Pretty resistant I'd say.&lt;/p&gt;

&lt;p&gt;An SSE vector-register is 128 bits, it will be able to hold two 64-bit accumulators per vector-register so one SSE register can be used to accumulate the sum of &lt;code&gt;|Green|Red|&lt;/code&gt; values and another vector-register for the &lt;code&gt;|Alpha|Blue|&lt;/code&gt; sums.&lt;/p&gt;

&lt;p&gt;The main loop would look something like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Load in a chunk of 4 32-bit pixels into a 128-bit register

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;|ABGR|ABGR|ABGR|ABGR|&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Shuffle the 8-bit channels within the vector so that the upper 64-bits of the register has one channel, and the lower 64-bits has another.

&lt;ul&gt;
&lt;li&gt;There is a bit of "waste" as you only have four bytes of a particular channel and 8-bytes within a lane. These values can be set to zero by passing a shuffle-index with the upper bit set when using &lt;code&gt;_mm_shuffle_epi8&lt;/code&gt;(A value such as &lt;code&gt;-1&lt;/code&gt; will get &lt;code&gt;pshufb&lt;/code&gt; to write &lt;code&gt;0&lt;/code&gt;). This way these bytes will not effect the sum.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;|0G0G0G0G|0R0R0R0R|&lt;/code&gt; or &lt;code&gt;|0A0A0A0A|0B0B0B0B|&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;|0000GGGG|0000RRRR|&lt;/code&gt; or &lt;code&gt;|0000AAAA|0000BBBB|&lt;/code&gt; works too&lt;/li&gt;
&lt;li&gt;Any permutation in particular works so long as the unused elements do not effect the horizontal sum and are 0&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;_mm_sad_epu&lt;/code&gt; the vector, getting two 16-bit sums within each of the 64-bit lanes, add this to the 64-bit accumulators

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;|0000ΣG16|0000ΣR16|&lt;/code&gt; or &lt;code&gt;|0000ΣA16|0000ΣB16|&lt;/code&gt; 16-bit sums, within the upper and lower 64-bit halfs of the 128-bit register&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A &lt;code&gt;psadbw&lt;/code&gt;-accelerated pixel-summing loop that handles four pixels at a time would look something like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F0fwkl4qvjd7mmztew66v.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fthepracticaldev.s3.amazonaws.com%2Fi%2F0fwkl4qvjd7mmztew66v.gif"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// | 64-bit Red Sum | 64-bit Green Sum |&lt;/span&gt;
&lt;span class="n"&gt;__m128i&lt;/span&gt; &lt;span class="n"&gt;RedGreenSum64&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm_setzero_si128&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="c1"&gt;// | 64-bit Blue Sum | 64-bit Alpha Sum |&lt;/span&gt;
&lt;span class="n"&gt;__m128i&lt;/span&gt; &lt;span class="n"&gt;BlueAlphaSum64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm_setzero_si128&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="c1"&gt;// 4 pixels at a time! (SSE)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;Count&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;__m128i&lt;/span&gt; &lt;span class="n"&gt;QuadPixel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm_stream_load_si128&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;__m128i&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Pixels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
    &lt;span class="n"&gt;RedGreenSum64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm_add_epi64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="c1"&gt;// Add it to the 64-bit accumulators&lt;/span&gt;
        &lt;span class="n"&gt;RedGreenSum64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;_mm_sad_epu8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="c1"&gt;// compute | 0+G+0+G+0+G+0+G | 0+R+0+R+0+R+0+R |&lt;/span&gt;
            &lt;span class="n"&gt;_mm_shuffle_epi8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="c1"&gt;// Shuffle the bytes to | 0G0G0G0G | 0R0R0R0R |&lt;/span&gt;
                &lt;span class="n"&gt;QuadPixel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;_mm_set_epi8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="c1"&gt;// Green&lt;/span&gt;
                    &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="c1"&gt;// Red&lt;/span&gt;
                    &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="c1"&gt;// SAD against 0, which just returns the original unsigned value&lt;/span&gt;
            &lt;span class="n"&gt;_mm_setzero_si128&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;BlueAlphaSum64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm_add_epi64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="c1"&gt;// Add it to the 64-bit accumulators&lt;/span&gt;
        &lt;span class="n"&gt;BlueAlphaSum64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;_mm_sad_epu8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="c1"&gt;// compute | 0+A+0+A+0+A+0+A | 0+B+0+B+0+B+0+B |&lt;/span&gt;
            &lt;span class="n"&gt;_mm_shuffle_epi8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="c1"&gt;// Shuffle the bytes to | 0A0A0A0A | 0B0B0B0B |&lt;/span&gt;
                &lt;span class="n"&gt;QuadPixel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="n"&gt;_mm_set_epi8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="c1"&gt;// Alpha&lt;/span&gt;
                    &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="c1"&gt;// Blue&lt;/span&gt;
                    &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="c1"&gt;// SAD against 0, which just returns the original unsigned value&lt;/span&gt;
            &lt;span class="n"&gt;_mm_setzero_si128&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;After doing chunks of 4 at a time, it can handle the unaligned pixels(there will only ever be 3 or less left-over) by extracting the 64-bit accumulators from the vector-registers, and falling back to the usual serial method.&lt;br&gt;
Though there are some slight optimizations that can be done here too. &lt;code&gt;bextr&lt;/code&gt; can extract continuous bits a little quicker without doing a shift-and-a-mask to get the upper color channels. x86 has &lt;a href="http://flint.cs.yale.edu/cs421/papers/x86-asm/x86-registers.png" rel="noopener noreferrer"&gt;register aliasing for the lower two bytes of its general purpose registers&lt;/a&gt; though so a &lt;code&gt;bextr&lt;/code&gt; would probably be overhandling for the lower color channels in the lower bytes.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Extract the 64-bit accumulators from the vector registers of the previous loop&lt;/span&gt;
&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt; &lt;span class="n"&gt;RedSum64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm_cvtsi128_si64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RedGreenSum64&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt; &lt;span class="n"&gt;GreenSum64&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm_extract_epi64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RedGreenSum64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt; &lt;span class="n"&gt;BlueSum64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm_cvtsi128_si64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BlueAlphaSum64&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint64_t&lt;/span&gt; &lt;span class="n"&gt;AlphaSum64&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm_extract_epi64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BlueAlphaSum64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// New serial method&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;Count&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt; &lt;span class="n"&gt;CurColor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Pixels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="n"&gt;AlphaSum64&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;_bextr_u64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;CurColor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;BlueSum64&lt;/span&gt;  &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;_bextr_u64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;CurColor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// I'm being oddly specific here to make it obvious for the&lt;/span&gt;
    &lt;span class="c1"&gt;// compiler to do some ah/bh/ch/dh register trickery&lt;/span&gt;
    &lt;span class="c1"&gt;//                                              V&lt;/span&gt;
    &lt;span class="n"&gt;GreenSum64&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="k"&gt;static_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;CurColor&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;  &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="n"&gt;RedSum64&lt;/span&gt;   &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="k"&gt;static_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;CurColor&lt;/span&gt;       &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;// Average&lt;/span&gt;
&lt;span class="n"&gt;RedSum64&lt;/span&gt;   &lt;span class="o"&gt;/=&lt;/span&gt; &lt;span class="n"&gt;Count&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;GreenSum64&lt;/span&gt; &lt;span class="o"&gt;/=&lt;/span&gt; &lt;span class="n"&gt;Count&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;BlueSum64&lt;/span&gt;  &lt;span class="o"&gt;/=&lt;/span&gt; &lt;span class="n"&gt;Count&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="n"&gt;AlphaSum64&lt;/span&gt; &lt;span class="o"&gt;/=&lt;/span&gt; &lt;span class="n"&gt;Count&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Interleave&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;static_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;AlphaSum64&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;static_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;BlueSum64&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;static_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="n"&gt;GreenSum64&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;  &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;static_cast&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint32_t&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;uint8_t&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="n"&gt;RedSum64&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;  &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This implementation so far with a 3840x2160 image on an &lt;a href="https://en.wikichip.org/wiki/intel/core_i7/i7-7500u" rel="noopener noreferrer"&gt;i7-7500U&lt;/a&gt; shows an approximate &lt;strong&gt;x2.6&lt;/strong&gt; increase in performance over the serial method. It now takes less than half the time to process an image now.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Serial: #10121AFF |      7100551ns
Fast  : #10121AFF |      2701641ns
Speedup: 2.628236
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;With &lt;a href="https://en.wikipedia.org/wiki/Advanced_Vector_Extensions" rel="noopener noreferrer"&gt;AVX2&lt;/a&gt; this algorithm can be updated to process &lt;strong&gt;8&lt;/strong&gt; pixels a time, then &lt;strong&gt;4&lt;/strong&gt; pixels a time before resorting to the serial algorithm for unaligned data(there will only be 7 or less unaligned pixels, think &lt;a href="https://en.wikipedia.org/wiki/Greedy_algorithm" rel="noopener noreferrer"&gt;greedy algorithms&lt;/a&gt;). With much larger &lt;strong&gt;256-bit&lt;/strong&gt; vectors, all four 64-bit accumulators can reside within a single AVX2 register.&lt;br&gt;
Though, AVX2's massive 256-bit vector-registers is almost just an alias for two regular 128-bit SSE registers from before with the additional benefit of being able to compactly handle two 128-bit registers with 1 instruction.&lt;br&gt;
This also means that cross-lane arithmetic(shuffling elements across the full width of a 256-bit register, rather than staying within the upper and lower 128-bit halfs) can be tricky as crossing the 128-bit boundary needs some special attention. A solution to this is to shuffle first within the upper and lower 128-bit lanes, and then using a much larger cross-lane shuffle to further unpack the channels into continuous values before computing a &lt;code&gt;_mm256_sad_epu8&lt;/code&gt; on each of the four 64-bit lanes.&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Vector of four 64-bit accumulators for Red,Green,Blue, and Alpha&lt;/span&gt;
&lt;span class="n"&gt;__m256i&lt;/span&gt; &lt;span class="n"&gt;RGBASum64&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm256_setzero_si256&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="c1"&gt;// 8 pixels at a time! (AVX/AVX2)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;Count&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;__m256i&lt;/span&gt; &lt;span class="n"&gt;OctaPixel&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm256_loadu_si256&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;__m256i&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Pixels&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;
    &lt;span class="c1"&gt;// Shuffle within 128-bit lanes&lt;/span&gt;
    &lt;span class="c1"&gt;// | ABGRABGRABGRABGR | ABGRABGRABGRABGR |&lt;/span&gt;
    &lt;span class="c1"&gt;// | AAAABBBBGGGGRRRR | AAAABBBBGGGGRRRR |&lt;/span&gt;
    &lt;span class="c1"&gt;// Setting up for 64-bit lane sad_epu8&lt;/span&gt;
    &lt;span class="n"&gt;__m256i&lt;/span&gt; &lt;span class="n"&gt;Deinterleave&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm256_shuffle_epi8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;OctaPixel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;_mm256_broadcastsi128_si256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;_mm_set_epi8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="c1"&gt;// Alpha&lt;/span&gt;
                &lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="c1"&gt;// Blue&lt;/span&gt;
                &lt;span class="mi"&gt;14&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="c1"&gt;// Green&lt;/span&gt;
                &lt;span class="mi"&gt;13&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="c1"&gt;// Red&lt;/span&gt;
                &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// Cross-lane shuffle&lt;/span&gt;
    &lt;span class="c1"&gt;// | AAAABBBBGGGGRRRR | AAAABBBBGGGGRRRR |&lt;/span&gt;
    &lt;span class="c1"&gt;// | AAAAAAAA | BBBBBBBB | GGGGGGGG | RRRRRRRR |&lt;/span&gt;
    &lt;span class="n"&gt;Deinterleave&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm256_permutevar8x32_epi32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;Deinterleave&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;_mm256_set_epi32&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="c1"&gt;// Alpha&lt;/span&gt;
            &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="c1"&gt;// Blue&lt;/span&gt;
            &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="c1"&gt;// Green&lt;/span&gt;
            &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="c1"&gt;// Red&lt;/span&gt;
            &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// | ASum64 | BSum64 | GSum64 | RSum64 |&lt;/span&gt;
    &lt;span class="n"&gt;RGBASum64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm256_add_epi64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;RGBASum64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;_mm256_sad_epu8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;Deinterleave&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;_mm256_setzero_si256&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Pass the accumulators onto the next SSE loop from above&lt;/span&gt;
&lt;span class="n"&gt;__m128i&lt;/span&gt; &lt;span class="n"&gt;BlueAlphaSum64&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm256_extractf128_si256&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RGBASum64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="n"&gt;__m128i&lt;/span&gt; &lt;span class="n"&gt;RedGreenSum64&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_mm256_castsi256_si128&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RGBASum64&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="kt"&gt;size_t&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;Count&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
&lt;span class="p"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This implementation so far(AVX2,SSE,and Serial) with the same 3840x2160 image as before on an &lt;a href="https://en.wikichip.org/wiki/intel/core_i7/i7-7500u" rel="noopener noreferrer"&gt;i7-7500U&lt;/a&gt; shows an approximate &lt;strong&gt;x4.1&lt;/strong&gt; increase in performance over the serial method. It now takes less than a &lt;strong&gt;forth&lt;/strong&gt; of the time to calculate the color average over the serial version!&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Serial: #10121AFF |      7436508ns
Fast  : #10121AFF |      1802768ns
Speedup: 4.125050
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  RGB8? RG8? R8?
&lt;/h3&gt;

&lt;p&gt;Not explored in this little write-up are other pixel formats such as the three-channel &lt;code&gt;RGB&lt;/code&gt; with no alpha or just a 2-channel &lt;code&gt;RG&lt;/code&gt; image or even just a monochrome "R8" image of just one channel.&lt;br&gt;
Other formats will work with the same principle as the &lt;code&gt;RGBA&lt;/code&gt; format one so long as you account for the proper shuffling needed.&lt;/p&gt;

&lt;p&gt;Depending on where you are at in your pixel processing, consider the different cases of how you can take a 16-byte chunk out of a stream of 3-byte &lt;code&gt;RGB&lt;/code&gt; pixels&lt;br&gt;
If your bytes were organized &lt;code&gt;|RGB|RGB|RGB|RGB|...&lt;/code&gt; and a vector-register with a width of 16-bytes, then you'll always be taking &lt;code&gt;16/3 = 15.3333...&lt;/code&gt; &lt;code&gt;RGB&lt;/code&gt; pixels at once. And every &lt;strong&gt;3&lt;/strong&gt;rd chunk (&lt;code&gt;0.333.. * 3 = 1.0&lt;/code&gt;) is when the period repeats. So only 3 shuffle-masks are ever needed depending on where your index &lt;code&gt;i&lt;/code&gt; lands on the array. Some ASCII art might help visualize it&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Note how different the bytes align when taking regular 16-byte chunks out of a
stream of 3-byte pixels.
Different shuffle patterns must be used to account for each case:

 &amp;gt;RGBRGBRGBRGBRGBRG&amp;lt;
0:|---SIMD Reg.---|
  RGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGB...
                  &amp;gt;BRGBRGBRGBRGBRGBR&amp;lt;
1:                 |---SIMD Reg.---|
  RGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGB...
                                   &amp;gt;GBRGBRGBRGBRGBRGB&amp;lt;
2:                                  |---SIMD Reg.---|
  RGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGB...
                   This is the same as iteration 0  &amp;gt;RGBRGBRGBRGBRGBRG&amp;lt;
3:                                                   |---SIMD Reg.---|
  RGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGB...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;You can also process much larger &lt;strong&gt;48&lt;/strong&gt;-byte (&lt;code&gt;lcd(16,3)&lt;/code&gt;) aligned chunks to have more productive iterations at a higher granularity:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Now you don't have to worry about byte-level alignment and can do all the shuffles at once
 &amp;gt;RGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGB&amp;lt;
0:|---SIMD Reg.---||---SIMD Reg.---||---SIMD Reg.---|
  RGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGBRGB...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Once you got your sums, then it's just a division and interleave to turn these statistical averages into a new color value.&lt;/p&gt;

&lt;p&gt;For RG8, the same principle applies but much more trivial since 2-byte pixels naturally align themselves with power-of-two register widths.&lt;/p&gt;

&lt;p&gt;For R8, the summing step reduces to just be a sum-of-bytes which is a topic precisely &lt;a href="http://0x80.pl/notesen/2018-10-24-sse-sumbytes.html" rel="noopener noreferrer"&gt;covered by Wojciech Muła&lt;/a&gt;. After getting the sum, divide by the number of pixels to get the statistical average.&lt;/p&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev.to%2Fassets%2Fgithub-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/Wunkolo" rel="noopener noreferrer"&gt;
        Wunkolo
      &lt;/a&gt; / &lt;a href="https://github.com/Wunkolo/qAverageColor" rel="noopener noreferrer"&gt;
        qAverageColor
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      SIMD accelerated method to get the average color of an RGBA8 image
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;qAverageColor &lt;a href="https://github.com/Wunkolo/qAverageColorLICENSE" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/6581c31c16c1b13ddc2efb92e2ad69a93ddc4a92fd871ff15d401c4c6c9155a4/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f6c6963656e73652d4d49542d626c75652e737667" alt="GitHub license"&gt;&lt;/a&gt;
&lt;/h1&gt;
&lt;/div&gt;

&lt;p&gt;&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;br&gt;
&lt;thead&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;/thead&gt;
&lt;br&gt;
&lt;tbody&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;Serial&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;SSE4.2&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;AVX2&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;AVX512&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;


&lt;td&gt;&lt;a rel="noopener noreferrer" href="https://github.com/Wunkolo/qAverageColormedia/Pat-Serial.gif"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FWunkolo%2FqAverageColormedia%2FPat-Serial.gif" alt="Serial"&gt;&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;&lt;a rel="noopener noreferrer" href="https://github.com/Wunkolo/qAverageColormedia/Pat-SSE.gif"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FWunkolo%2FqAverageColormedia%2FPat-SSE.gif" alt="SSE"&gt;&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;&lt;a rel="noopener noreferrer" href="https://github.com/Wunkolo/qAverageColormedia/Pat-AVX2.gif"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FWunkolo%2FqAverageColormedia%2FPat-AVX2.gif" alt="AVX2"&gt;&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;&lt;a rel="noopener noreferrer" href="https://github.com/Wunkolo/qAverageColormedia/Pat-AVX512.gif"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2FWunkolo%2FqAverageColormedia%2FPat-AVX512.gif" alt="AVX512"&gt;&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;Processor&lt;/td&gt;


&lt;td&gt;Speedup&lt;/td&gt;


&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;&lt;a href="https://en.wikichip.org/wiki/intel/core_i7/i7-7500u" rel="nofollow noopener noreferrer"&gt;i7-7500u&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x2.8451&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x4.4087&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;&lt;em&gt;N/A&lt;/em&gt;&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;&lt;a href="https://en.wikichip.org/wiki/intel/core_i3/i3-6100" rel="nofollow noopener noreferrer"&gt;i3-6100&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x2.7258&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x4.2358&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;&lt;em&gt;N/A&lt;/em&gt;&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;&lt;a href="https://en.wikichip.org/wiki/intel/core_i5/i5-8600k" rel="nofollow noopener noreferrer"&gt;i5-8600k&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x2.4015&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x2.6498&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;&lt;em&gt;N/A&lt;/em&gt;&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;&lt;a href="https://en.wikichip.org/wiki/intel/core_i9/i9-7900x" rel="nofollow noopener noreferrer"&gt;i9-7900x&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x2.0651&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x2.6140&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x4.2704&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;&lt;a href="https://en.wikichip.org/wiki/intel/core_i7/i7-1065g7" rel="nofollow noopener noreferrer"&gt;i7-1065G7&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x3.9124&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x4.6244&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x5.4683&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;tr&gt;
&lt;br&gt;
&lt;td&gt;&lt;a href="https://ark.intel.com/content/www/us/en/ark/products/212325/intel-core-i9-11900k-processor-16m-cache-up-to-5-30-ghz.html" rel="nofollow noopener noreferrer"&gt;i9-11900k&lt;/a&gt;&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x4.2406&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x5.3535&lt;/td&gt;
&lt;br&gt;
&lt;td&gt;x6.0925&lt;/td&gt;
&lt;br&gt;
&lt;/tr&gt;
&lt;br&gt;
&lt;/tbody&gt;
&lt;br&gt;
&lt;/table&gt;&lt;/div&gt;&lt;/p&gt;

&lt;p&gt;&lt;sup&gt;Tested against a synthetic 10-megapixel image, GCC version 8.2.1&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;This is a little snippet write-up of code that will find the average color of an image of RGBA8 pixels (32-bits per pixel, 8 bits per channel) by utilizing the &lt;code&gt;psadbw&lt;/code&gt;(&lt;code&gt;_mm_sad_epu8&lt;/code&gt;) instruction to accumulate the sum of each individual channel into a (very overflow-safe)64-bit accumulator.&lt;/p&gt;

&lt;p&gt;Inspired by the &lt;a href="http://0x80.pl/notesen/2018-10-24-sse-sumbytes.html" rel="nofollow noopener noreferrer"&gt;"SIMDized sum of all bytes in the array" write-up&lt;/a&gt; by &lt;a href="https://twitter.com/pshufb" rel="nofollow noopener noreferrer"&gt;Wojciech Muła&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The usual method to get the statistical average of each color channel in an image is pretty trivial:&lt;/p&gt;


&lt;ol&gt;

&lt;li&gt;Load in a pixel&lt;/li&gt;

&lt;li&gt;Unpack the individual color channels&lt;/li&gt;

&lt;li&gt;Sum the channel values into a an accumulator&lt;/li&gt;

&lt;li&gt;When you've summed them all…&lt;/li&gt;

&lt;/ol&gt;
&lt;/div&gt;
&lt;br&gt;
  &lt;/div&gt;
&lt;br&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/Wunkolo/qAverageColor" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;br&gt;
&lt;/div&gt;
&lt;br&gt;


</description>
      <category>cpp</category>
      <category>c</category>
      <category>performance</category>
      <category>algorithms</category>
    </item>
  </channel>
</rss>
