Using SSE Registers for General Purpose. SIMD with General-purpose Registers

Madis Kalme, 2007-11-26

Revision: 1.0

Nowadays the goal of a new CPU is not to give you the highest clock rates, not even the highest performance (while it's still a big part of it), but to give you the best performance-per-watt number. While specific instructions are introduced, the course is to perform best in every possible application. May it be for office work, scientific, educational or gaming.

Technologies become standardized, more unified and cross-compatible. Programming languages are all getting higher level thus more general. Assembly needs a kick in its generalization.

This article was composed from XMM as GPR, SIMD with GPR presentation, first presented at FASM Technical Conference II, Brno, Czech Republic, 25 August 2007.

Examples in the General World

Some samples include Intel and AMD both implementing MMX, SSE and SSE2 instruction sets while Nvidia and ATI are going for unified shaders (opposed to previously popular pixel and vertex shaders). Nvidia also introduced CUDA a while ago and offering its Tesla servers.

The first example is very helpful because programmers don't have to write code to multiple extensions and runtime capabilities detection. They can focus on these specific instructions and optimize them.

The example where graphic cards are pushing to general shaders is that the programmer doesn't have to balance the code between different shaders, but the only parameter to look at is what's the budget of shaders and its up to the card to divide the tasks. No wasted resources!

CUDA's importance is probably the easiest to understand. It provides general purpose computing on a graphical processing unit GPGPU for short. This means that while GPU sits there doing nothing, the CPU can off-load some or all of its tasks to the GPU.

Another interesting example is that Intel is planning a multi-core CPU where each core has a task like a general purpose CPU or maybe a signal processing unit. Perhaps some of the cores do the job usually meant for GPUs. The general part comes from the ability to share caches between each-other and use any of the special purpose hardware as they see fit. Its not a counter-example because the CPU is in one physical package and you can do pretty much anything with it.

Using SSE Registers for General Purpose

First I would like to comment the reason behind leaving the MMX out. While MMX is still supported for backward compatibility, the usage of these instructions is in a slight disadvantage. When 64-bit is around, the GPR size is equal to MMX's registers and XMM is just two times wider, which gives more possibilities. Moreover current Core 2 and future Barcelona architectures feature a 128-bit wide execution of SSE, so virtually every calculation one can do in MMX is possible to convert to SSE without a loss in execution speed.

The SSE registers are named from xmm0 up to xmm7 (or xmm15 on 64-bit architecture). They are designed specifically to operate on packed data in a parallel matter. They support floating point (IEEE 754 single and double data-types), signed or unsigned integer (from bytes to qwords) and boolean arithmetic and logic operations. Also scalar operation is supported, which was distinctive to GP programming, that is, it operates on only one unit of data, leaving others untouched.

The boolean operations can be used on any data without a change, though there are defined instructions for better performance under different circumstances:

ANDPx (A & B), integer equivalent is PAND
ANDNPx (A & ~B), PANDN for integer
ORPx (A | B), POR for int
XORPx (A ^ B), PXOR for int
PSLLx (A << i), same for all datatypes
PSRL (A >> i), PSRAx (A >>> i)

The x stands for S (single) or D (double) floating point data-type. Notice the difference between ANDPS, ANDPD and PAND! They operate on different types, but the result remains the same. Here we can see that its not really important what data is in the SSE registers, rather how we want to use them.

What is not so common knowledge is the horde of other instructions, like MOVD, MOVQ, MOVxxx, etc. to move data between memory, GPRs and XMMs. The xxx stands for all the possible combinations there are. More information in the Intel Software Developer's Manual (volume 2A). All the arithmetic and shuffling instruction are convertible to general purpose world. In other words, we will be using SSE in a way it wasn't meant to be used. Of course it was meant to be used effectively, but not as a GPR replacement.

For example with a line like this - MOVQ XMM0, RAX - we are storing the value of RAX in the lower half of XMM0 register. The problems are that it's 5 bytes long, 0.5-clock throughput and a 2-clock latency compared to MOV RBX, RAX which is 3 bytes long, 0.33-clock throughput and only 1-clock latency. We will win in situations where otherwise PUSH/POP instructions or other memory-related moves are needed. Moving back to RAX is faster having 0.33-clock throughput. So we can settle with the fact that it's not exactly a replacement for fast GPR moves, but it's a nice alternative when we ran out of registers and memory is too slow.

Another situation is doing PSHUFB XMM0, [m128] after MOVQ XMM0, r64. We can imitate BSWAP behavior with m128=..0001020304050607h which is 10 bytes+16-byte memory, 8 clocks in total, while BSWAP takes 3 bytes and 6 clocks. The winning scenario is when we need to BSWAP and copy the result or when we need other shuffling order.

Code Obfuscation

The changeover from general-purpose registers to SSE registers can be used for simple code obfuscation as well. We can just take advantage of the fact that SSE instructions are generally less known. Additionally, scalar operation can be extended to vector operation to confuse a third-side reverser.

An example can be an add operation, such as:

foo DD 12345678h
 ...
 lea ebx, [foo]
 add eax, [ebx]

If the code which follows this instruction doesn't depend on status flags and there are at least three dwords allocated on stack, a machine obfuscator can expand the add operation this way:

 push [ebx]
 movd xmm0, eax
 paddd xmm0, [esp]   ; pretend parallel operation
 pop eax             ; release stack
 movd eax, xmm0      ; get the result

Another example: another simple obfuscation scheme is adding random value right before original operation and reverting it back by subtraction after the operation. Let's take sub [ebx], eax instruction now:

 push ecx
 mov ecx, 0deadbeefh   ; random value
 add [ebx], ecx
 sub [ebx], eax        ; original instruction
 sub [ebx], ecx        ; revert back
 pop ecx

Since various constants can be made using SSE operations in hidden way, we can obfuscate also making the constant value:

 push ecx
 pcmpgtb xmm0, xmm0     ; ...00000000 (obfuscated pxor xmm0, xmm0)
 pcmpeqb xmm1, xmm1     ; ...FFFFFFFF
 punpcklbw xmm0, xmm1   ; ...FF00FF00
 movd ecx, xmm0         ; FF00FF00
 ...

Other examples include but are not limited to:

Data movement (even conditional)
Aritmetic operation and comparison
Bit wizardry, logic

An old trick – the FPU

There were times when programmers used FPU to speed up data movement. Successive FLD, FSTP instructions moved up to 8 bytes at once. 64-bit extensions make it obsolete because now there are registers with the same width. Its not only faster (latency wise 5 vs 6), but also adds the ability to make it parallel and/or rearrange data (FPU by default reverses data because of FILO – stack-like – behavior). SSE moves double the amount (16 bytes) in the exact same time.

I did some quick tests by throwing some data-copying measures in my program and outputted the results. The CPU crunching the code was Core 2 Duo. Why-questions to these results might be answered by Agner Fog's materials:

Mode	Type of Code	Clocks	Unrolled	Clocks	Notes
32-bit
	`rep movsd`	640	-	-	the fastest
	`movsd`	6165	-	-
	`mov [esi] > eax > [edi]`	2069	×4	1120
	`fld [esi], fstp [edi]`	2069	×8	1549
	`movd [esi] > xmm0 > [edi]`	2070	×2	1379
			×4	1316
			×8	1669	unrolled too much
64-bit
	`rep movsq`	628	-	-
	`movsq`	3106	-	-
	`mov [rsi] > rax > [rdi]`	1045	×2	563
	`mov [rsi] > rax > [rdi]`	1045	×4	602	unrolled too much
	`fld [rsi], fstp [rdi]`	1045	×2	562
			×4	628	unrolled too much
			×8	683	unrolled too much
	`movq [rsi] > xmm0 > [rdi]`	1046	×2	555
			×4	602	unrolled too much
			×8	625	unrolled too much
			×16	738	unrolled too much
	`movaps [rsi] > xmm0 > [rdi]`	533	×2	298
			×4	258	the fastest
			×8	326

SIMD with GPRs

To make the message even clearer, I will bring out that generalization works both ways.

Broadcast

It means that if for example, you would like to copy an important byte, word or a dword to the full register, you'll just have to IMUL r/m, const where const is a repetitive pattern of 01h, 0001h or 00000001h. The result is a repetitive pattern of the source.

Example: Byte, word and dword broadcast:

 mov  eax, 34h         ;34h is important byte
 imul eax, 01010101h   ;eax = 34343434h

 mov  eax, 34h
 imul eax, 00010001h   ;eax = 00340034h

 mov  rax, 34h
 imul rax, 100000001h  ;rax = 0000003400000034h

Addition and Subtraction

SIMD addition or subtraction is also possible, but with double-sized types. It means that for every word you want to calculate, you need at least a dword sized space, because addition or subtraction might wrap over, and you need to catch that. Actually in one calculation scope, you need only one extra bit to remember the wrapover, but it's easiest to implement on a full machine type border like word, dword or a qword. An example would be: byte addition (signed or unsigned) is possible only with at least word alignment in a register.

Example: SIMD byte addition – 43h+FFh and FEh+5h done in parallel:

 mov eax, 004300FEh   ;word data
 add eax, 00FF0005h   ;eax = 01420103

The result is 0142_0103, but because we're dealing with bytes, we are only considering 42_03h as the result.

Multiplication

Multiplies are a bit limited and only possible with the same constant. Take for example two byte-sized integers in a dword and another byte to be multiplied with. One would look like 002300FFh and the other is something like 35h. The multiplied result will look like this: 073F_34CBh. Multiplies are always double-width, so again, like with simpler arithmetic, the byte calculation results in words. Interpreting the results is up to the programmer, but unsigned multiplication is the easiest to understand.

Example: SIMD word multiply – 14h×4h and FFh×4h done in parallel:

 mov  eax, 001400FFh   ;word data
 imul eax, 00000004h   ;eax = 0050_03FC

64-bit extensions add more possibilities with wider registers.

Conclusion

Generalization between SSE and GPRs is possible, but needs some pondering. With some thinking, such generalization is even more efficient than the conventional method. We can use all the potential the CPU has and thus making our algorithms faster or impress our colleges. While all the world is moving, why should we stand still?

Referenced materials

NVIDIA CUDA Homepage

Barcelona's 128-bit wide SSE execution

Intel Software Developer's Manual, Volume 2A, Instruction Set Reference A-M

Agner Fog's optimization manuals, every assembly programmer's bible

Comments

Continue to discussion board.

The author doesn't wish to publish his e-mail here.

Revisions

2007-11-26

1.0

First public version

Madis Kalme

(dates format correspond to ISO 8601)