<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Levente Kurusa</title>
    <description>The latest articles on DEV Community by Levente Kurusa (@ilevex).</description>
    <link>https://dev.to/ilevex</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F63947%2F9e57c70f-6df8-4354-b216-cceff4ed8069.png</url>
      <title>DEV Community: Levente Kurusa</title>
      <link>https://dev.to/ilevex</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ilevex"/>
    <language>en</language>
    <item>
      <title>x86 ISA Extensions part II: SSE</title>
      <dc:creator>Levente Kurusa</dc:creator>
      <pubDate>Sun, 15 Apr 2018 15:38:00 +0000</pubDate>
      <link>https://dev.to/ilevex/x86-isa-extensions-part-ii-sse-jpa</link>
      <guid>https://dev.to/ilevex/x86-isa-extensions-part-ii-sse-jpa</guid>
      <description>&lt;p&gt;Welcome back to this series exploring the many extensions the x86 architecture has seen over the past decades. In this installment of the series, we will be looking at the successor to &lt;a href="https://dev.to/ilevex/x86-isa-extensions-part-i-mmx-1hjf"&gt;MMX&lt;/a&gt;: Streaming SIMD Extensions， or SSE for short. Most of these instructions are SIMD (as their name implies), which stands for &lt;strong&gt;S&lt;/strong&gt;ingle &lt;strong&gt;I&lt;/strong&gt;nstruction &lt;strong&gt;M&lt;/strong&gt;ultiple &lt;strong&gt;D&lt;/strong&gt;ata. In brief, SIMD instructions are similar to the ones we’ve covered in the MMX article: an instruction can possibly work on multiple data groups.&lt;/p&gt;

&lt;p&gt;SSE was introduced in 1999 with Intel’s Pentium III soon after Intel saw AMD’s “3DNow!” extension (we will cover this extension in a future installment, but right now I lack access to an AMD machine that I could use 🙂). A question arises naturally: SSE wasn’t the first SIMD set that Intel has introduced to the x86 family of processors, so why did Intel create a new extension set? Unfortunately, MMX had two major problems at the time. First, the registers it “introduced” were aliases of previously existing registers (amusingly, this was touted as an advantage for a while because of the easier context switching), this meant that floating points and MMX operations couldn’t coexist. Second, MMX only worked on integers, it had no support for floating points which was an increasingly important aspect of 3D computer graphics. SSE adds dozens of new instructions that operate on an independent register set and a few integer instructions that continue to operate on the old MMX registers.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(A slight note before we start: In this article “SSE” refers to the very first SSE extension introduced by Intel. In future installments of this series, we will explore SSE2, SSE3, SSSE3, SSE4 and SSE4.1, but here we focus on “SSE1”.)&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Do you have SSE?
&lt;/h2&gt;

&lt;p&gt;As with all instruction set extensions, there is a chance that your CPU does not have it. The chances are once again pretty slim with SSE, given its age, but it’s always interesting to see how one can feel sure about its CPU’s support for SSE.&lt;/p&gt;

&lt;p&gt;On Linux:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /proc/cpuinfo | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-wq&lt;/span&gt; sse &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; “SSE available”  &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; “SSE not available”
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On OS X/macOS:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;sysctl machdep.cpu.features | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-wq&lt;/span&gt; SSE &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; “SSE available”  &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; “SSE not available”
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Alternatively, CPUID offers a way to gather this information on bare-metal or in an OS-agnostic way. SSE is indicated by CPUID leaf 1, EDX bit 25:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.text
.globl _is_sse_available
_is_sse_available:
    pushq   %rbx

    movq    $1, %rax
    cpuid
    movq    %rdx, %rax
    shrq    $25, %rax
    andq    $1, %rax

    popq    %rbx
    ret
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once you are satisfied that your CPU allows for SSE instructions, it’s time to dive in to the specifics of SSE!&lt;/p&gt;

&lt;h2&gt;
  
  
  Registers
&lt;/h2&gt;

&lt;p&gt;Since SSE introduces actual, new registers (in contrast with its predecessor), I think it’s useful to have a quick glance at them. SSE added eight, 128-bit registers named: &lt;code&gt;%xmm0, %xmm1, ..., %xmm7&lt;/code&gt;. (Amusingly, &lt;code&gt;xmm&lt;/code&gt; is the reverse of &lt;code&gt;mmx&lt;/code&gt; which is the name of the MMX registers, I assume this is meant as a pun, but I couldn’t find a source confirming) In stark contrast with MMX, SSE does not allow for multiple data types. Each XMM register can hold four, 32-bit &lt;a href="https://en.m.wikipedia.org/wiki/Single-precision_floating-point_format" rel="noopener noreferrer"&gt;single-precision floating points&lt;/a&gt;, while MMX could hold different widths of integers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%xmm0, %xmm1, ..., %xmm7:
*- - - - - - - - -*- - - - - - - - -*- - - - - - - - -*- - - - - - - - -*
| 32-bit SP float | 32-bit SP float | 32-bit SP float | 32-bit SP float |
*- - - - - - - - -*- - - - - - - - -*- - - - - - - - -*- - - - - - - - -*
|                            128-bit value                              |
*- - - - - - - - -*- - - - - - - - -*- - - - - - - - -*- - - - - - - - -*
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this figure, each line represents a data type that can be in the XMM register with SSE. I’ve put the “128-bit value” in the figure, since if you only load data into the register and not issue any floating point operation, then it can be potentially any unstructured data. However, when using floating points only the four, single-precision floating points are supported as data in the register. Unstructured data can potentially cause exceptions to happen.&lt;/p&gt;

&lt;p&gt;To control the state of some operations, an additional control and status register is added, dubbed &lt;code&gt;MXCSR&lt;/code&gt;. This register cannot be accessed using the &lt;code&gt;mov&lt;/code&gt; family of instructions, rather SSE adds two new instructions that allow the register to be loaded and stored, &lt;code&gt;LDMXSCR&lt;/code&gt; &amp;amp; &lt;code&gt;STMXSCR&lt;/code&gt;. The figure shows its layout and then explains its usage within the SSE environment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbk83ku4ts2m7ojdiy6ik.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbk83ku4ts2m7ojdiy6ik.png" alt="The MXCSR register"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Bits 0-5 in MXCSR are flags that show that a certain type of floating-point exception occurred, they are also sticky meaning that the user (or the OS) has to reset them manually after an exception, otherwise they’ll stay set forever. Bits 7-12 are masking bits, they can be used to used to stop the CPU from issuing an exception when certain conditions pertaining to the specific exception are met, in which case the processor will return a value (qNaN, sNaN, definite integer or one of the source operands; see [1] for more details).&lt;/p&gt;

&lt;p&gt;For more information on the specific meanings of the registers, look at [1], Chapter 10.2.3.&lt;/p&gt;

&lt;h1&gt;
  
  
  Instructions
&lt;/h1&gt;

&lt;p&gt;Now that we have covered the registers introduced in the SSE extension, let’s have a look at what new instructions have Intel added and their implications. To utilize SSE to its fullest extent, the very first step to be taken is to move data into the new XMM registers, SSE offers a couple instructions, out of which the following (&lt;code&gt;movaps&lt;/code&gt; &amp;amp; &lt;code&gt;movups&lt;/code&gt;) are the most common:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Create a memory location with four single-prec floats
vector0: .dq 3.14, 2.71, 1.23, 4.56
scalar0: .dd 1234
vector1: .dq 3.62, 6.73, 8.41, 9.55

movaps vector0, %xmm0
movups vector1, %xmm1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;movaps&lt;/code&gt; stands for &lt;strong&gt;MOV&lt;/strong&gt;e &lt;strong&gt;A&lt;/strong&gt;ligned &lt;strong&gt;P&lt;/strong&gt;acked &lt;strong&gt;S&lt;/strong&gt;ingle Precision Float, and &lt;code&gt;movups&lt;/code&gt; stands for the same, but &lt;strong&gt;U&lt;/strong&gt;naligned. The distinction between aligned and unaligned access is important, and generally developers should aim for aligned access whenever possible for better overall performance.&lt;/p&gt;

&lt;p&gt;Now that we have managed to move data into an XMM register, let’s do something with it. A trivial example and one that we explored previously is some simple vector manipulation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# assuming vector0 and vector1 from the previous snippet

movaps vector0, %xmm0
movups vector1, %xmm1

addps %xmm0, %xmm1 # ADD Packed Single precision float
subps %xmm0, %xmm1 # undo previous operation
maxps %xmm0, %xmm1 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;maxps&lt;/code&gt; is a very handy instruction: it compares each of the four single-precision floats in the XMM registers and then moves the larger float into the destination operand (it can be either a register like &lt;code&gt;%xmm1&lt;/code&gt; or a 128-bit memory location). This instruction alone can save a large chunk of cycles by avoiding a  loop and many &lt;code&gt;cmp&lt;/code&gt; and branch instructions.&lt;/p&gt;

&lt;p&gt;An other interesting aspect of the SSE extensions are cacheability controls. The application programmer can now tell the CPU that some memory is &lt;em&gt;“non-temporal”&lt;/em&gt;, that is it won’t be needed in the near future so do not pollute the cache with it, like so:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;movntps %xmm0, vector0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The reverse (i.e., if the programmer knows that a certain memory location will be needed in the near future) can also be signaled to the processor using the &lt;code&gt;PREFETCH&lt;/code&gt; family of instructions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Instruction&lt;/th&gt;
&lt;th&gt;Pentium III&lt;/th&gt;
&lt;th&gt;Pentium 4/Xeon&lt;/th&gt;
&lt;th&gt;Temporal?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;prefetch0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;L2 or L1&lt;/td&gt;
&lt;td&gt;L2&lt;/td&gt;
&lt;td&gt;Temporal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;prefetch1&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;L2&lt;/td&gt;
&lt;td&gt;L2&lt;/td&gt;
&lt;td&gt;Temporal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;prefetch2&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;L2&lt;/td&gt;
&lt;td&gt;L2&lt;/td&gt;
&lt;td&gt;Temporal&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;prefetchnta&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;L1&lt;/td&gt;
&lt;td&gt;L2&lt;/td&gt;
&lt;td&gt;Non-temporal&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The next extension we will be looking at will be the SSE2 extension set that builds on the foundations of SSE and MMX to deliver better performance. Starting with the new installment, we will introduce benchmarks, too. In the meantime, have a look at a cache of examples in the &lt;a href="https://github.com/levex/x86-isa-extensions" rel="noopener noreferrer"&gt;GitHub Repo for the series!&lt;/a&gt; Until next time!&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;[1]: &lt;a href="https://software.intel.com/en-us/articles/intel-sdm" rel="noopener noreferrer"&gt;Intel IA-32 Software Development Manual, Chapter 11.5.2: SIMD Floating-Point Exception Conditions&lt;/a&gt;&lt;/p&gt;

</description>
      <category>x86</category>
      <category>lowlevel</category>
      <category>c</category>
      <category>linux</category>
    </item>
    <item>
      <title>x86 ISA Extensions part I: MMX</title>
      <dc:creator>Levente Kurusa</dc:creator>
      <pubDate>Sun, 01 Apr 2018 17:26:04 +0000</pubDate>
      <link>https://dev.to/ilevex/x86-isa-extensions-part-i-mmx-1hjf</link>
      <guid>https://dev.to/ilevex/x86-isa-extensions-part-i-mmx-1hjf</guid>
      <description>

&lt;p&gt;Welcome to this series about instruction set extensions to the x86 architecture. X86 is a computer architecture that has evolved loads over the years and there have been many extensions to the original instruction set (including 64-bit "long" mode). Over the course of a few blog posts, we explore these extensions and the reasoning behind their existence.&lt;/p&gt;

&lt;p&gt;So, the first extension I'd like to talk about is the MMX extension originally introduced with the Pentium P5 family of Intel processors in the late 1990s. Let's dive in!&lt;/p&gt;

&lt;h2&gt;
  
  
  Do you have MMX?
&lt;/h2&gt;

&lt;p&gt;As with every instruction set extension, there is a possibility that the CPU of your system does not support it. The chances are pretty slim with MMX given that it has been around for a long while, but just to be sure follow these instructions:&lt;/p&gt;

&lt;p&gt;On Linux:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /proc/cpuinfo | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-wq&lt;/span&gt; mmx &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"MMX available"&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"MMX not available"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;On OS X/macOS:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;sysctl machdep.cpu.features | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-wq&lt;/span&gt; MMX &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"MMX available"&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"MMX not available"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Alternatively, you can use the &lt;code&gt;CPUID&lt;/code&gt; instruction to figure whether your CPU supports MMX (as indicated by Leaf 1, Bit 23):&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.text
.globl _is_mmx_available
_is_mmx_available:
    pushq   %rbx

    movq    $1, %rax
    cpuid
    movq    %rdx, %rax
    shrq    $23, %rax
    andq    $1, %rax

    popq    %rbx
    ret
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h2&gt;
  
  
  Technical details
&lt;/h2&gt;

&lt;p&gt;Interestingly, MMX does not introduce new registers, but instead opts to introduce aliases for the 80-bit x87 FPU registers' bottom 64 bits. These are called &lt;code&gt;%mmx0, %mmx1, ..., %mmx7&lt;/code&gt;. Since these are only aliases and not real registers, it is immediately obvious that the registers cannot be used while an FPU operation is taking place. Since FPU registers are 80-bit long and MMX "registers" are 64-bit it's important to note where in those FPU registers are the MMX registers: they form the 64-bit mantissa of the original FPU register, and the remaining 16 bits are all set to 1. This is useful, since it means the FPU can recognize the SIMD data in the registers as NaN or infinities and of course, software can distinguish between the two types of data as well.&lt;/p&gt;

&lt;p&gt;But why did Intel choose to use aliases instead of adding new registers? They wanted to be trivially compatible with existing operating systems' context switching code which already knew how to save and restore the FPU registers and now by the virtue of aliasing, it also supports saving and restoring the MMX registers.&lt;/p&gt;

&lt;p&gt;The main selling point of MMX is the ability to pack multiple values into the MMX registers and do operations on each individual value separately, in one instruction, hence the SIMD (&lt;em&gt;Single Instruction Multiple Data&lt;/em&gt;) nature. It is possible to have eight, one-byte values in a single MMX register:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%mmx0 = 
+------+------+------+------+------+------+------+------+
|   7  |   6  |  5   |   4  |   3  |   2  |  1   |   0  | Byte
+------+------+------+------+------+------+------+------+
| 0x11 | 0x22 | 0x33 | 0x44 | 0x55 | 0x66 | 0x77 | 0x88 | Values
+------+------+------+------+------+------+------+------+
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;or it is possible to have two, 4-byte long values in the register:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%mmx1 = 
+------+------+------+------+------+------+------+------+
|   7  |   6  |  5   |   4  |   3  |   2  |  1   |   0  | Byte
+------+------+------+------+------+------+------+------+
|          0xcc77ff88       |          0x11223344       | Values
+------+------+------+------+------+------+------+------+
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Similarly, it is possible to have four, two-byte values as well.&lt;/p&gt;

&lt;h2&gt;
  
  
  New instructions
&lt;/h2&gt;

&lt;p&gt;An example usage of the new instructions can be found in the &lt;a href="https://github.com/levex/x86-isa-extensions/tree/master/mmx"&gt;GitHub Repo&lt;/a&gt; of this series. It simply adds two times four, one-byte values together.&lt;/p&gt;

&lt;p&gt;Some simple instructions introduced by MMX can be seen in the following table:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Instruction&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;emms&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Reset the MMX state&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;paddb&lt;/code&gt; / &lt;code&gt;paddw&lt;/code&gt; / &lt;code&gt;paddd&lt;/code&gt; / &lt;code&gt;paddq&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Add two groups of bytes, words, double-words, or quad-words&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;psub&lt;/code&gt; / &lt;code&gt;psubw&lt;/code&gt; / &lt;code&gt;psubd&lt;/code&gt; / &lt;code&gt;psubq&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Substract two groups of bytes, words, double-words, or quad-words&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;pcmpeqb&lt;/code&gt; / &lt;code&gt;pcmpeqw&lt;/code&gt; / &lt;code&gt;pcmpeqd&lt;/code&gt; / &lt;code&gt;pcmpeqq&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Compare two groups of bytes, words, double-words, or quad-words for equality&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A more complex instruction would be the &lt;code&gt;unpck&lt;/code&gt; class of instructions that allow interleaving data from two groups of data by doubling the group size:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;punpckhbw&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;punpckhwd&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;punpckhdq&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;punpcklbw&lt;/code&gt; &lt;/li&gt;
&lt;li&gt;&lt;code&gt;punpcklwd&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;punpckldq&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The interleave done by &lt;code&gt;punpckhbw&lt;/code&gt; is best described by the following graphic, however let us decompose this instruction into a more readable form:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;p&lt;/code&gt; =&amp;gt; packed&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;unpck&lt;/code&gt; =&amp;gt; unpack&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;h&lt;/code&gt; =&amp;gt; high order&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;b&lt;/code&gt; =&amp;gt; from bytes&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;w&lt;/code&gt; =&amp;gt; to words&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Putting it all together this seems to imply that this instruction interleaves the higher order part (top half) of a group of bytes into a group of words in the destination register. This can be still quite confusing, so here's a graphic that explains it a bit better I'd hope:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
             Source Register                          Destination Register                     

+----+----+----+----+----+----+----+----+   +----+----+----+----+----+----+----+----+          
| Y7 | Y6 | Y5 | Y4 | Y3 | Y2 | Y1 | Y0 |   | X7 | X6 | X5 | X4 | X3 | X2 | X1 | X0 |          
+----+----+----+----+----+----+----+----+   +----+----+----+----+----+----+----+----+          
   |    |    |    |                            |    |    |    |                                
   |    |    |    |                            |    |    |    |                                
   |    |    |    |                            |    |    |    |                                
   |    |    |    |           +---------+------+--+-+----+  +-+                                
   |    |    |    |           |         |         |         |                                  
   |    |    |    |           |         |         |         |                                  
   |    |    |    |           v         v         v         v                                  
   |    |    |    |   +----+----+----+----+----+----+----+----+                                
   |    |    |    |   | Y7 | X7 | Y6 | X6 | Y5 | X5 | Y4 | X4 | Destination Register           
   |    |    |    |   +----+----+----+----+----+----+----+----+                                
   |    |    |    |      ^         ^         ^         ^                                       
   |    |    |    |      |         |         |         |                                       
   +----+----+----+------+---------+---------+---------+                                                                                                         
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h2&gt;
  
  
  Successor
&lt;/h2&gt;

&lt;p&gt;AMD shortly caught on with its own extension to Intel's MMX, named "3DNow!" which didn't really see much success, but we will cover it in a next installment of this series.&lt;/p&gt;

&lt;p&gt;Other successors include an "Extended MMX" from Intel, and SSE (&lt;em&gt;Streaming SIMD Extensions&lt;/em&gt;). Extended MMX is of particular interest because it introduces several new, interesting instructions to MMX:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Instruction&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;movntq&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Move a quad-word (64-bit value) to memory and   do not put it in the cache (bypass the cache)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pextrw&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Extract a (specified) word from a group&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pinsrw&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Insert a word into a group at a specified location&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pmovmskb&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Create an 8-bit integer from the most significant bits of eight one-byte values in an MMX register&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;pavgb&lt;/code&gt; / &lt;code&gt;pavgw&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Averages the (unsigned) bytes or words&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;I hope you enjoyed this part and will enjoy the rest of the series. I would be highly appreciative of any feedback you may have. If you have an example of a use-case for MMX, it would be nice to hear from you and of course, feel free to submit a pull request to the repo linked above.&lt;/p&gt;


</description>
      <category>x86</category>
      <category>lowlevel</category>
      <category>c</category>
      <category>linux</category>
    </item>
  </channel>
</rss>
