DEV Community: Levente Kurusa

x86 ISA Extensions part II: SSE

Levente Kurusa — Sun, 15 Apr 2018 15:38:00 +0000

Welcome back to this series exploring the many extensions the x86 architecture has seen over the past decades. In this installment of the series, we will be looking at the successor to MMX: Streaming SIMD Extensions， or SSE for short. Most of these instructions are SIMD (as their name implies), which stands for Single Instruction Multiple Data. In brief, SIMD instructions are similar to the ones we’ve covered in the MMX article: an instruction can possibly work on multiple data groups.

SSE was introduced in 1999 with Intel’s Pentium III soon after Intel saw AMD’s “3DNow!” extension (we will cover this extension in a future installment, but right now I lack access to an AMD machine that I could use 🙂). A question arises naturally: SSE wasn’t the first SIMD set that Intel has introduced to the x86 family of processors, so why did Intel create a new extension set? Unfortunately, MMX had two major problems at the time. First, the registers it “introduced” were aliases of previously existing registers (amusingly, this was touted as an advantage for a while because of the easier context switching), this meant that floating points and MMX operations couldn’t coexist. Second, MMX only worked on integers, it had no support for floating points which was an increasingly important aspect of 3D computer graphics. SSE adds dozens of new instructions that operate on an independent register set and a few integer instructions that continue to operate on the old MMX registers.

(A slight note before we start: In this article “SSE” refers to the very first SSE extension introduced by Intel. In future installments of this series, we will explore SSE2, SSE3, SSSE3, SSE4 and SSE4.1, but here we focus on “SSE1”.)

Do you have SSE?

As with all instruction set extensions, there is a chance that your CPU does not have it. The chances are once again pretty slim with SSE, given its age, but it’s always interesting to see how one can feel sure about its CPU’s support for SSE.

On Linux:

$ cat /proc/cpuinfo | grep -wq sse && echo “SSE available”  || echo “SSE not available”

On OS X/macOS:

$ sysctl machdep.cpu.features | grep -wq SSE && echo “SSE available”  || echo “SSE not available”

Alternatively, CPUID offers a way to gather this information on bare-metal or in an OS-agnostic way. SSE is indicated by CPUID leaf 1, EDX bit 25:

.text
.globl _is_sse_available
_is_sse_available:
    pushq   %rbx

    movq    $1, %rax
    cpuid
    movq    %rdx, %rax
    shrq    $25, %rax
    andq    $1, %rax

    popq    %rbx
    ret

Once you are satisfied that your CPU allows for SSE instructions, it’s time to dive in to the specifics of SSE!

Registers

Since SSE introduces actual, new registers (in contrast with its predecessor), I think it’s useful to have a quick glance at them. SSE added eight, 128-bit registers named: %xmm0, %xmm1, ..., %xmm7. (Amusingly, xmm is the reverse of mmx which is the name of the MMX registers, I assume this is meant as a pun, but I couldn’t find a source confirming) In stark contrast with MMX, SSE does not allow for multiple data types. Each XMM register can hold four, 32-bit single-precision floating points, while MMX could hold different widths of integers.

%xmm0, %xmm1, ..., %xmm7:
*- - - - - - - - -*- - - - - - - - -*- - - - - - - - -*- - - - - - - - -*
| 32-bit SP float | 32-bit SP float | 32-bit SP float | 32-bit SP float |
*- - - - - - - - -*- - - - - - - - -*- - - - - - - - -*- - - - - - - - -*
|                            128-bit value                              |
*- - - - - - - - -*- - - - - - - - -*- - - - - - - - -*- - - - - - - - -*

In this figure, each line represents a data type that can be in the XMM register with SSE. I’ve put the “128-bit value” in the figure, since if you only load data into the register and not issue any floating point operation, then it can be potentially any unstructured data. However, when using floating points only the four, single-precision floating points are supported as data in the register. Unstructured data can potentially cause exceptions to happen.

To control the state of some operations, an additional control and status register is added, dubbed MXCSR. This register cannot be accessed using the mov family of instructions, rather SSE adds two new instructions that allow the register to be loaded and stored, LDMXSCR & STMXSCR. The figure shows its layout and then explains its usage within the SSE environment.

Bits 0-5 in MXCSR are flags that show that a certain type of floating-point exception occurred, they are also sticky meaning that the user (or the OS) has to reset them manually after an exception, otherwise they’ll stay set forever. Bits 7-12 are masking bits, they can be used to used to stop the CPU from issuing an exception when certain conditions pertaining to the specific exception are met, in which case the processor will return a value (qNaN, sNaN, definite integer or one of the source operands; see [1] for more details).

For more information on the specific meanings of the registers, look at [1], Chapter 10.2.3.

Instructions

Now that we have covered the registers introduced in the SSE extension, let’s have a look at what new instructions have Intel added and their implications. To utilize SSE to its fullest extent, the very first step to be taken is to move data into the new XMM registers, SSE offers a couple instructions, out of which the following (movaps & movups) are the most common:

# Create a memory location with four single-prec floats
vector0: .dq 3.14, 2.71, 1.23, 4.56
scalar0: .dd 1234
vector1: .dq 3.62, 6.73, 8.41, 9.55

movaps vector0, %xmm0
movups vector1, %xmm1

movaps stands for MOVe Aligned Packed Single Precision Float, and movups stands for the same, but Unaligned. The distinction between aligned and unaligned access is important, and generally developers should aim for aligned access whenever possible for better overall performance.

Now that we have managed to move data into an XMM register, let’s do something with it. A trivial example and one that we explored previously is some simple vector manipulation:

# assuming vector0 and vector1 from the previous snippet

movaps vector0, %xmm0
movups vector1, %xmm1

addps %xmm0, %xmm1 # ADD Packed Single precision float
subps %xmm0, %xmm1 # undo previous operation
maxps %xmm0, %xmm1

maxps is a very handy instruction: it compares each of the four single-precision floats in the XMM registers and then moves the larger float into the destination operand (it can be either a register like %xmm1 or a 128-bit memory location). This instruction alone can save a large chunk of cycles by avoiding a loop and many cmp and branch instructions.

An other interesting aspect of the SSE extensions are cacheability controls. The application programmer can now tell the CPU that some memory is “non-temporal”, that is it won’t be needed in the near future so do not pollute the cache with it, like so:

movntps %xmm0, vector0

The reverse (i.e., if the programmer knows that a certain memory location will be needed in the near future) can also be signaled to the processor using the PREFETCH family of instructions:

Instruction	Pentium III	Pentium 4/Xeon	Temporal?
`prefetch0`	L2 or L1	L2	Temporal
`prefetch1`	L2	L2	Temporal
`prefetch2`	L2	L2	Temporal
`prefetchnta`	L1	L2	Non-temporal

Conclusion

The next extension we will be looking at will be the SSE2 extension set that builds on the foundations of SSE and MMX to deliver better performance. Starting with the new installment, we will introduce benchmarks, too. In the meantime, have a look at a cache of examples in the GitHub Repo for the series! Until next time!

References

[1]: Intel IA-32 Software Development Manual, Chapter 11.5.2: SIMD Floating-Point Exception Conditions

x86 ISA Extensions part I: MMX

Levente Kurusa — Sun, 01 Apr 2018 17:26:04 +0000

Welcome to this series about instruction set extensions to the x86 architecture. X86 is a computer architecture that has evolved loads over the years and there have been many extensions to the original instruction set (including 64-bit "long" mode). Over the course of a few blog posts, we explore these extensions and the reasoning behind their existence.

So, the first extension I'd like to talk about is the MMX extension originally introduced with the Pentium P5 family of Intel processors in the late 1990s. Let's dive in!

Do you have MMX?

As with every instruction set extension, there is a possibility that the CPU of your system does not support it. The chances are pretty slim with MMX given that it has been around for a long while, but just to be sure follow these instructions:

On Linux:

$ cat /proc/cpuinfo | grep -wq mmx && echo "MMX available" || echo "MMX not available"

On OS X/macOS:

$ sysctl machdep.cpu.features | grep -wq MMX && echo "MMX available" || echo "MMX not available"

Alternatively, you can use the CPUID instruction to figure whether your CPU supports MMX (as indicated by Leaf 1, Bit 23):

.text
.globl _is_mmx_available
_is_mmx_available:
    pushq   %rbx

    movq    $1, %rax
    cpuid
    movq    %rdx, %rax
    shrq    $23, %rax
    andq    $1, %rax

    popq    %rbx
    ret

Technical details

Interestingly, MMX does not introduce new registers, but instead opts to introduce aliases for the 80-bit x87 FPU registers' bottom 64 bits. These are called %mmx0, %mmx1, ..., %mmx7. Since these are only aliases and not real registers, it is immediately obvious that the registers cannot be used while an FPU operation is taking place. Since FPU registers are 80-bit long and MMX "registers" are 64-bit it's important to note where in those FPU registers are the MMX registers: they form the 64-bit mantissa of the original FPU register, and the remaining 16 bits are all set to 1. This is useful, since it means the FPU can recognize the SIMD data in the registers as NaN or infinities and of course, software can distinguish between the two types of data as well.

But why did Intel choose to use aliases instead of adding new registers? They wanted to be trivially compatible with existing operating systems' context switching code which already knew how to save and restore the FPU registers and now by the virtue of aliasing, it also supports saving and restoring the MMX registers.

The main selling point of MMX is the ability to pack multiple values into the MMX registers and do operations on each individual value separately, in one instruction, hence the SIMD (Single Instruction Multiple Data) nature. It is possible to have eight, one-byte values in a single MMX register:

%mmx0 = 
+------+------+------+------+------+------+------+------+
|   7  |   6  |  5   |   4  |   3  |   2  |  1   |   0  | Byte
+------+------+------+------+------+------+------+------+
| 0x11 | 0x22 | 0x33 | 0x44 | 0x55 | 0x66 | 0x77 | 0x88 | Values
+------+------+------+------+------+------+------+------+

or it is possible to have two, 4-byte long values in the register:

%mmx1 = 
+------+------+------+------+------+------+------+------+
|   7  |   6  |  5   |   4  |   3  |   2  |  1   |   0  | Byte
+------+------+------+------+------+------+------+------+
|          0xcc77ff88       |          0x11223344       | Values
+------+------+------+------+------+------+------+------+

Similarly, it is possible to have four, two-byte values as well.

New instructions

An example usage of the new instructions can be found in the GitHub Repo of this series. It simply adds two times four, one-byte values together.

Some simple instructions introduced by MMX can be seen in the following table:

Instruction	Description
`emms`	Reset the MMX state
`paddb` / `paddw` / `paddd` / `paddq`	Add two groups of bytes, words, double-words, or quad-words
`psub` / `psubw` / `psubd` / `psubq`	Substract two groups of bytes, words, double-words, or quad-words
`pcmpeqb` / `pcmpeqw` / `pcmpeqd` / `pcmpeqq`	Compare two groups of bytes, words, double-words, or quad-words for equality

A more complex instruction would be the unpck class of instructions that allow interleaving data from two groups of data by doubling the group size:

punpckhbw
punpckhwd
punpckhdq
punpcklbw
punpcklwd
punpckldq

The interleave done by punpckhbw is best described by the following graphic, however let us decompose this instruction into a more readable form:

p => packed
unpck => unpack
h => high order
b => from bytes
w => to words

Putting it all together this seems to imply that this instruction interleaves the higher order part (top half) of a group of bytes into a group of words in the destination register. This can be still quite confusing, so here's a graphic that explains it a bit better I'd hope:


             Source Register                          Destination Register                     

+----+----+----+----+----+----+----+----+   +----+----+----+----+----+----+----+----+          
| Y7 | Y6 | Y5 | Y4 | Y3 | Y2 | Y1 | Y0 |   | X7 | X6 | X5 | X4 | X3 | X2 | X1 | X0 |          
+----+----+----+----+----+----+----+----+   +----+----+----+----+----+----+----+----+          
   |    |    |    |                            |    |    |    |                                
   |    |    |    |                            |    |    |    |                                
   |    |    |    |                            |    |    |    |                                
   |    |    |    |           +---------+------+--+-+----+  +-+                                
   |    |    |    |           |         |         |         |                                  
   |    |    |    |           |         |         |         |                                  
   |    |    |    |           v         v         v         v                                  
   |    |    |    |   +----+----+----+----+----+----+----+----+                                
   |    |    |    |   | Y7 | X7 | Y6 | X6 | Y5 | X5 | Y4 | X4 | Destination Register           
   |    |    |    |   +----+----+----+----+----+----+----+----+                                
   |    |    |    |      ^         ^         ^         ^                                       
   |    |    |    |      |         |         |         |                                       
   +----+----+----+------+---------+---------+---------+

Successor

AMD shortly caught on with its own extension to Intel's MMX, named "3DNow!" which didn't really see much success, but we will cover it in a next installment of this series.

Other successors include an "Extended MMX" from Intel, and SSE (Streaming SIMD Extensions). Extended MMX is of particular interest because it introduces several new, interesting instructions to MMX:

Instruction	Description
`movntq`	Move a quad-word (64-bit value) to memory and do not put it in the cache (bypass the cache)
`pextrw`	Extract a (specified) word from a group
`pinsrw`	Insert a word into a group at a specified location
`pmovmskb`	Create an 8-bit integer from the most significant bits of eight one-byte values in an MMX register
`pavgb` / `pavgw`	Averages the (unsigned) bytes or words

Conclusion

I hope you enjoyed this part and will enjoy the rest of the series. I would be highly appreciative of any feedback you may have. If you have an example of a use-case for MMX, it would be nice to hear from you and of course, feel free to submit a pull request to the repo linked above.