Madis Kalme, 2007-11-26 | Revision: 1.0 |
Nowadays the goal of a new CPU is not to give you the highest clock rates, not even the highest performance (while it's still a big part of it), but to give you the best performance-per-watt number. While specific instructions are introduced, the course is to perform best in every possible application. May it be for office work, scientific, educational or gaming.
Technologies become standardized, more unified and cross-compatible. Programming languages are all getting higher level thus more general. Assembly needs a kick in its generalization.
This article was composed from XMM as GPR, SIMD with
GPR
presentation, first presented at FASM
Technical Conference II, Brno, Czech Republic, 25 August 2007.
Some samples include Intel and AMD both implementing MMX, SSE and SSE2 instruction sets while Nvidia and ATI are going for unified shaders (opposed to previously popular pixel and vertex shaders). Nvidia also introduced CUDA a while ago and offering its Tesla servers.
The first example is very helpful because programmers don't have to write code to multiple extensions and runtime capabilities detection. They can focus on these specific instructions and optimize them.
The example where graphic cards are pushing to general shaders is that the programmer doesn't have to balance the code between different shaders, but the only parameter to look at is what's the budget of shaders and its up to the card to divide the tasks. No wasted resources!
CUDA's importance is probably the easiest to understand. It provides general purpose computing on a graphical processing unit GPGPU for short. This means that while GPU sits there doing nothing, the CPU can off-load some or all of its tasks to the GPU.
Another interesting example is that Intel is planning a multi-core CPU where each core has a task like a general purpose CPU or maybe a signal processing unit. Perhaps some of the cores do the job usually meant for GPUs. The general part comes from the ability to share caches between each-other and use any of the special purpose hardware as they see fit. Its not a counter-example because the CPU is in one physical package and you can do pretty much anything with it.
First I would like to comment the reason behind leaving the MMX out. While MMX is still supported for backward compatibility, the usage of these instructions is in a slight disadvantage. When 64-bit is around, the GPR size is equal to MMX's registers and XMM is just two times wider, which gives more possibilities. Moreover current Core 2 and future Barcelona architectures feature a 128-bit wide execution of SSE, so virtually every calculation one can do in MMX is possible to convert to SSE without a loss in execution speed.
The SSE registers are named from xmm0
up to xmm7
(or xmm15
on 64-bit
architecture). They are designed specifically to
operate on packed data in a parallel matter. They support floating
point (IEEE 754 single and double data-types), signed or unsigned
integer (from bytes to qwords) and boolean arithmetic and logic
operations. Also scalar operation is supported, which was
distinctive to GP programming,
that is, it operates on only one unit of data, leaving others untouched.
The boolean operations can be used on any data without a change, though there are defined instructions for better performance under different circumstances:
ANDPx
(A & B), integer equivalent
is PAND
ANDNPx
(A & ~B), PANDN
for integerORPx
(A | B), POR
for
intXORPx
(A ^ B), PXOR
for
intPSLLx
(A << i), same for all
datatypesPSRL
(A >> i), PSRAx
(A >>> i)The x
stands for S
(single) or D
(double) floating point
data-type. Notice the difference between ANDPS
,
ANDPD
and PAND
! They
operate on different types, but the result remains the same. Here we
can see that its not really important what data is in the
SSE registers, rather how we want to use them.
What is not so common knowledge is the horde of
other instructions, like MOVD
, MOVQ
,
MOVxxx
, etc. to move data
between memory, GPRs and XMMs. The
xxx
stands for all the
possible combinations there are. More information in the Intel
Software Developer's Manual (volume 2A). All the arithmetic and
shuffling instruction are convertible to general purpose world. In
other words, we will be using SSE in a way it wasn't meant to be
used. Of course it was meant to be used effectively, but not as
a GPR replacement.
For example with a line like this - MOVQ XMM0, RAX
-
we are storing the value of RAX
in the lower half of XMM0
register. The problems are that it's 5 bytes long, 0.5-clock
throughput and a 2-clock latency compared to MOV RBX, RAX
which is
3 bytes long, 0.33-clock throughput and only 1-clock latency. We
will win in situations where otherwise PUSH
/POP
instructions or
other memory-related moves are needed. Moving back to RAX
is faster
having 0.33-clock throughput. So we can settle with the fact that
it's not exactly a replacement for fast GPR moves, but it's a nice
alternative when we ran out of registers and memory is too slow.
Another situation is doing PSHUFB XMM0, [m128]
after
MOVQ XMM0, r64
. We can imitate BSWAP
behavior with
m128=..0001020304050607h which is 10 bytes+16-byte memory, 8 clocks
in total, while BSWAP takes 3 bytes and 6 clocks. The winning
scenario is when we need to BSWAP
and copy the result or when we
need other shuffling order.
The changeover from general-purpose registers to SSE registers can be used for simple code obfuscation as well. We can just take advantage of the fact that SSE instructions are generally less known. Additionally, scalar operation can be extended to vector operation to confuse a third-side reverser.
An example can be an add
operation, such as:
foo DD 12345678h ... lea ebx, [foo] add eax, [ebx]
If the code which follows this instruction doesn't
depend on status flags and there are at least three dwords allocated
on stack, a machine obfuscator can expand the add
operation this way:
push [ebx] movd xmm0, eax paddd xmm0, [esp] ; pretend parallel operation pop eax ; release stack movd eax, xmm0 ; get the result
Another example: another simple obfuscation scheme
is adding random value right before original operation
and reverting it back by subtraction after the operation. Let's take
sub [ebx], eax
instruction now:
push ecx mov ecx, 0deadbeefh ; random value add [ebx], ecx sub [ebx], eax ; original instruction sub [ebx], ecx ; revert back pop ecx
Since various constants can be made using SSE operations in hidden way, we can obfuscate also making the constant value:
push ecx pcmpgtb xmm0, xmm0 ; ...00000000 (obfuscated pxor xmm0, xmm0) pcmpeqb xmm1, xmm1 ; ...FFFFFFFF punpcklbw xmm0, xmm1 ; ...FF00FF00 movd ecx, xmm0 ; FF00FF00 ...
Other examples include but are not limited to:
There were times when programmers used FPU to speed
up data movement. Successive FLD
, FSTP
instructions moved up to
8 bytes at once. 64-bit extensions make it obsolete because now
there are registers with the same width. Its not only faster
(latency wise 5 vs 6), but also adds the ability to make it parallel
and/or rearrange data (FPU by default reverses data because of FILO – stack-like –
behavior). SSE moves double the amount (16 bytes)
in the exact same time.
I did some quick tests by throwing some data-copying measures in my program and outputted the results. The CPU crunching the code was Core 2 Duo. Why-questions to these results might be answered by Agner Fog's materials:
Mode | Type of Code | Clocks | Unrolled | Clocks | Notes |
---|---|---|---|---|---|
32-bit | |||||
rep movsd |
640 | - | - | the fastest | |
movsd |
6165 | - | - | ||
mov [esi] > eax > [edi] |
2069 | ×4 | 1120 | ||
fld [esi], fstp [edi] |
2069 | ×8 | 1549 | ||
movd [esi] > xmm0 > [edi] |
2070 | ×2 | 1379 | ||
×4 | 1316 | ||||
×8 | 1669 | unrolled too much | |||
64-bit | |||||
rep movsq |
628 | - | - | ||
movsq |
3106 | - | - | ||
mov [rsi] > rax > [rdi] |
1045 | ×2 | 563 | ||
×4 | 602 | unrolled too much | |||
fld [rsi], fstp [rdi] |
1045 | ×2 | 562 | ||
×4 | 628 | unrolled too much | |||
×8 | 683 | unrolled too much | |||
movq [rsi] > xmm0 > [rdi] |
1046 | ×2 | 555 | ||
×4 | 602 | unrolled too much | |||
×8 | 625 | unrolled too much | |||
×16 | 738 | unrolled too much | |||
movaps [rsi] > xmm0 > [rdi] |
533 | ×2 | 298 | ||
×4 | 258 | the fastest | |||
×8 | 326 |
To make the message even clearer, I will bring out that generalization works both ways.
It means that if for example,
you would like to copy an important byte, word or a dword to the
full register, you'll just have to IMUL r/m, const
where const is
a repetitive pattern of 01h, 0001h or 00000001h. The result is
a repetitive pattern of the source.
Example: Byte, word and dword broadcast:
mov eax, 34h ;34h is important byte imul eax, 01010101h ;eax = 34343434h
mov eax, 34h imul eax, 00010001h ;eax = 00340034h
mov rax, 34h imul rax, 100000001h ;rax = 0000003400000034h
SIMD addition or subtraction is also possible, but with double-sized types. It means that for every word you want to calculate, you need at least a dword sized space, because addition or subtraction might wrap over, and you need to catch that. Actually in one calculation scope, you need only one extra bit to remember the wrapover, but it's easiest to implement on a full machine type border like word, dword or a qword. An example would be: byte addition (signed or unsigned) is possible only with at least word alignment in a register.
Example: SIMD byte addition – 43h+FFh and FEh+5h done in parallel:
mov eax, 004300FEh ;word data add eax, 00FF0005h ;eax = 01420103
The result is 0142_0103, but because we're dealing with bytes, we are only considering 42_03h as the result.
Multiplies are a bit limited and only possible with the same constant. Take for example two byte-sized integers in a dword and another byte to be multiplied with. One would look like 002300FFh and the other is something like 35h. The multiplied result will look like this: 073F_34CBh. Multiplies are always double-width, so again, like with simpler arithmetic, the byte calculation results in words. Interpreting the results is up to the programmer, but unsigned multiplication is the easiest to understand.
Example: SIMD word multiply – 14h×4h and FFh×4h done in parallel:
mov eax, 001400FFh ;word data imul eax, 00000004h ;eax = 0050_03FC
64-bit extensions add more possibilities with wider registers.
Generalization between SSE and GPRs is possible, but needs some pondering. With some thinking, such generalization is even more efficient than the conventional method. We can use all the potential the CPU has and thus making our algorithms faster or impress our colleges. While all the world is moving, why should we stand still?
Shaping the future with Intel© Tera-Scale computing research
Barcelona's 128-bit wide SSE execution
Intel Software Developer's Manual, Volume 2A, Instruction Set Reference A-M
Agner Fog's optimization manuals, every assembly programmer's bible
Continue to discussion board.
The author doesn't wish to publish his e-mail here.
2007-11-26 | 1.0 | First public version | Madis Kalme |
(dates format correspond to ISO 8601)