Cool one: Fused multiply accumulate (FMA) Using FMA everywhere hurts performance.

Transcript Cool one: Fused multiply accumulate (FMA) Using FMA everywhere hurts performance.

Cool one: Fused multiply accumulate (FMA)
Using FMA everywhere hurts performance
// ... stuff ...
// ... stuff ...
“optimized”
x[0] = y[0]; // 128b copy
x[1] = y[1]; // 128b copy
x = y; // 256b copy
// ... stuff ...
// ... stuff ...
This may cause huge slowdowns on some chips
What?
Intel Pentium 3 (1999)
AMD Athlon XP (2001)
Intel Pentium 4 (2001)
AMD Athlon 64 (2003)
Intel Sandy Bridge (2011)
AMD Bulldozer (2011)
Intel Haswell (2013)
Future AMD Chip (?)
Some 128 bit SIMD
instructions
128 bit SIMD
instructions
FP 256 bit SIMD
instructions
256 bit SIMD
instructions
/arch:SSE
/arch:SSE2
/arch:AVX
/arch:AVX2
Visual C++ ?
Visual Studio .NET 2003
Visual Studio 2010
Visual Studio 2013
Update 2
(optimization support)
New hotness!
1.
2.
#1

_mm_fmadd_ss, _mm_fmsub_ss,
_mm_fnmadd_ss, _mm_fnmsub_ss,
_mm_fmadd_sd, _mm_fmsub_sd,
_mm_fnmadd_sd, _mm_fnmsub_sd,
_mm_fmadd_ps, _mm_fmsub_ps,
_mm_fnmadd_ps, _mm_fnmsub_ps,
_mm_fmadd_pd, _mm_fmsub_pd,
_mm_fnmadd_pd, _mm_fnmsub_pd,
_mm256_fmadd_ps, _mm256_fmsub_ps,
_mm256_fnmadd_ps, _mm256_fnmsub_ps,
_mm256_fmadd_pd, _mm256_fmsub_pd,
_mm256_fnmadd_pd, _mm256_fnmsub_pd
/arch:AVX2



Mult = 5 cycles
Add = 3 cycles
FMA = 5 cycles
A
B
A
B
C
5 cycles
5 cycles
C
3 cycles
res
res
Mult = 5 cycles
Add = 3 cycles
FMA = 5 cycles
A
B
C
C
D
5 cycles
A
B
3 cycles
res
D
5 cycles
5 cycles
res
Mult = 5 cycles
Add = 3 cycles
FMA = 5 cycles
A[5]
B[5]
A[6]
B[6]
dp
A[5] B[5]
t1
A[6]
B[6]
5 cycles
3 cycles
3 cycles
...
...
5 cycles
...
t2




#2
Highly optimized CPU
code isn’t CPU code.

for (i=0; i<1000; i++)
A[i] = B[i] + C[i];
autovec
for (i=0; i<1000; i+=4)
xmm1 = vmovups B[i]
xmm2 = vaddps xmm1, C[i]
A[i] = vmovups xmm2

for (i=0; i<1000; i++)
A[i] = B[i] + C[i];
autovec
for (i=0; i<1000; i+=8)
ymm1 = vmovups B[i]
ymm2 = vaddps ymm1, C[i]
A[i] = vmovups ymm2
32-bit
float
scalar
CPU: 20 ms
Mem: 20 ms
128-bit
SIMD
Total: 40 ms
CPU: 10 ms Mem: 20 ms
Memory
Bound
Mem: 20 ms
CPU: 80 ms
256-bit
SIMD
Total: 30 ms
Total: 100 ms
2.5x
speedup
1.3x optimized CPU
Highly
speedup
code isn’t CPU code.
Windows task manager won’t help you here
#3

Courtesy of http://eigen.tuxfamily.org/



8.5 ms
8.5 ms
enh
yay
6.4 ms
this sucks
10 ms
struct MyData {
Vector4D v1; // 4 floats
Vector4D v2; // 4 floats
};
MyData x;
MyData y;
void func2() {
// ... unrelated stuff ...
func3();
// ... unrelated stuff ...
x.v1 = y.v1; // 128-bit copy
x.v2 = y.v2; // 128-bit copy
x = y; // 256-bit copy
}
This caused the
60% slowdown
on Haswell




bugs
deathly
potholes
void func1() {
for (int i = 0; i<10000; i++)
func2();
}
void func2() {
// ... unrelated stuff ...
func3();
// ... unrelated stuff ...
x = y; // 256-bit copy
}
void func3() {
// ... unrelated stuff ...
... = x.v1; // 128-bit load from x
}
vmovups YMMWORD PTR [rbx], ymm0
mov rcx, QWORD PTR __$ArrayPad$[rsp]
xor rcx, rsp
call __security_check_cookie
add rsp, 80
; 00000050H
pop rbx
ret 0
push rbx
sub rsp, 80
; 00000050H
mov rax, QWORD PTR __security_cookie
xor rax, rsp
mov QWORD PTR __$ArrayPad$[rsp], rax
mov rbx, r8
mov r8, rdx
mov rdx, rcx
lea rcx, QWORD PTR $T1[rsp]
mov rax, rsp
mov QWORD PTR [rax+8], rbx
mov QWORD PTR [rax+16], rsi
push rdi
sub rsp, 144
; 00000090H
vmovaps XMMWORD PTR [rax-24], xmm6
vmovaps XMMWORD PTR [rax-40], xmm7
vmovaps XMMWORD PTR [rax-56], xmm8
mov rsi, r8
mov rdi, rdx
mov rbx, rcx
vmovaps XMMWORD PTR [rax-72], xmm9
vmovaps XMMWORD PTR [rax-88], xmm10
vmovaps XMMWORD PTR [rax-104], xmm11
vmovaps XMMWORD PTR [rax-120], xmm12
vmovdqu xmm12, XMMWORD PTR __xmm@0000000000000000
test cl, 15
je SHORT $LN14@run
lea rdx, OFFSET FLAT:??_C@_1FM@KGHGDLJC@
lea rcx, OFFSET FLAT:??_C@_1BIM@JPMPBING@
mov r8d, 78
; 0000004eH
call _wassert
$LN14@run:
vmovupd xmm11, XMMWORD PTR [rsi]
vmovupd xmm10, XMMWORD PTR [rsi+16]




The performance landscape is changing.
Get to know your profiler.
Recap
Intel Pentium 3 (1999)
AMD Athlon XP (2001)
Intel Pentium 4 (2001)
AMD Athlon 64 (2003)
Intel Sandy Bridge (2011)
AMD Bulldozer (2011)
Intel Haswell (2013)
Future AMD Chip (?)
Some 128 bit SIMD
instructions
128 bit SIMD
instructions
FP 256 bit SIMD
instructions
256 bit SIMD
instructions
/arch:SSE
/arch:SSE2
/arch:AVX
/arch:AVX2
Visual C++ ?
Visual Studio .NET 2003
Visual Studio 2010
Visual Studio 2013
Update 2
(optimization support)
1.
2.
3.
Partner Program
SPECIAL OFFERS
for MSDN Ultimate subscribers
Go to http://msdn.Microsoft.com/specialoffers
Profile your code
Profile your code

Cool one: Fused multiply accumulate (FMA) Using FMA everywhere hurts performance.

Transcript Cool one: Fused multiply accumulate (FMA) Using FMA everywhere hurts performance.

Directory