下載：程序的循環優化方法（PPT）

Transcript 下載：程序的循環優化方法（PPT）

提升循环级并行
陈健
2002/11
Copyright © 2002 Intel Corporation
Agenda
 Introduction
 Who Cares?
 Definition
 Loop Dependence and Removal
 Dependency Identification Lab
 Summary
Introduction
 Loops must meet certain criteria…
–Iteration Independence
–Memory Disambiguation
–High Loop Count
–Etc…
Who Cares
 实现真正的并行:
– OpenMP
– Auto Parallelization…
 显式的指令级并行 ILP (Instruction Level
Parallelism)
–
–
–
–
Streaming SIMD (MMX, SSE, SSE2, …)
Software Pipelining on Intel® Itanium™ Processor
Remove Dependencies for the Out-of-Order Core
More Instructions run in parallel on Intel ItaniumProcessor
 自动编译器并行
– High Level Optimizations
Definition
 Loop Independence:
int a[MAX];
Iteration Y of a loop is
independent of when for (J=0;J<MAX;J++) {
a[J] = b[J];
or whether iteration X
}
happens
图例
OpenMP
OpenMP: True Parallelism
SIMD
SIMD: Vectorization
SWP
SWP: Software Pipelining
OOO
OOO: Out-of-Order Core
ILP
ILP: Instruction Level Parallelism
Green: Benefits from concept
Yellow: Some Benefits from Concept
Red: No Benefit from Concept
OpenMP
SIMD
SWP
OOO
ILP
Agenda
 Definition
 Who Cares?
 Loop Dependence and Removal
–Data Dependencies
–Removing Dependencies
–Data Ambiguity and the Compiler
 Dependency Removal Lab
 Summary
Flow Dependency
 Read After Write
 Cross-Iteration
Flow Dependence:
Variables written
then read in
different iterations
for (J=1; J<MAX; J++) {
A[J]=A[J-1];
}
A[1]=A[0];
A[2]=A[1];
OpenMP
SIMD
SWP
OOO
ILP
Anti-Dependency
 Write After Read
 Cross-Iteration
Anti-Dependence:
Variables written
then read in
different iterations
for (J=1; J<MAX; J++) {
A[J]=A[J+1];
}
A[1]=A[2];
A[2]=A[3];
OpenMP
SIMD
SWP
OOO
ILP
Output Dependency
 Write After Write
 Cross-Iteration
Output
Dependence:
Variables written
then written again
in a different
iteration
for (J=1; J<MAX; J++) {
A[J]=B[J];
A[J+1]=C[J];
}
A[1]=B[1];
A[2]=C[1];
A[2]=B[1];
A[3]=C[1];
OpenMP
SIMD
SWP
OOO
ILP
IntraIteration Dependency
 Dependency within
an iteration
 Hurts ILP
 May be
automatically
removed by
compiler
K = 1;
for (J=1; J<MAX; J++) {
A[J]=A[J] + 1;
B[K]=A[K] + 1;
K = K + 2;
}
A[1] = A[1] + 1;
B[1]= A[1] + 1;
OpenMP
SIMD
SWP
OOO
ILP
Remove Dependencies

Best Choice

Requirement for
true Parallelism

Not all
dependencies can
be removed
for (J=1; J<MAX; J++) {
A[J]=A[J-1] + 1;
}
for (J=1; J<MAX; J++) {
A[J]= A[0] + J;
}
OpenMP
SIMD
SWP
OOO
ILP
Increasing ILP, without
removing dependencies




Good: Unroll Loop
Make sure the
compiler can’t or
didn’t do this for you
Compiler should not
apply common subexpression elimination
Also notice that if this
is floating point data precision could be
altered
for (J=1;J<MAX;J++) {
A[J] =A[J-1] + B[J];
}
for (J=1;J<MAX;J+=2) {
A[J]=A[J-1] + B[J];
A[J+1]=A[J-1] + (B[J]
+ B[J+1]);
}
OpenMP
SIMD
SWP
OOO
ILP
Induction Variables
 Induction variables
are incremented on
each trip through the
loop
 Fix by replacing
increment
expressions with
pure function of loop
index
i1 = 0;
i2 = 0;
for(J=0,J<MAX,J++) {
i1 = i1 + 1;
B(i1) = …
i2 = i2 + J;
A(i2) = …
}
for(J=0,J<MAX,J++) {
B(J) = ...
A((J**2 + J)/2)= ...
}
OpenMP
SIMD
SWP
OOO
ILP
Reductions
 Reductions collapse
array data to scalar
data via associative
operations:
for (J=0; J<MAX; J++)
sum = sum + c[J];
 Take advantage of
associativity and
compute partial
sums or local
maximum in private
storage
 Next, combine
partial results into
shared result,
taking care to
synchronize access
OpenMP
SIMD
SWP
OOO
ILP
Data Ambiguity and the
Compiler
 Are the loop
iterations
independent?
 The C++ compiler has
no idea
void func(int *a, int *b) {
for (J=0;J<MAX;J++) {
a[J] = b[J];
}
}
 No chance for
optimization - In order
to run error free the
compiler assumes
that a and b overlap
OpenMP
SIMD
SWP
OOO
ILP
Function Calls
 Generally
function calls
inhibit ILP
 Exceptions:
for (J=0;J<MAX;J++) {
compute(a[J],b[J]);
a[J][1]=sin(b[J]);
}
–Transcendentals
–IPO compiles
OpenMP
SIMD
SWP
OOO
ILP
Function Calls with State
 Many routines
maintain state
across calls:
– Memory allocation
– Pseudo-random
number generators
– I/O routines
– Graphics libraries
– Third-party libraries
 Parallel access to
such routines is
unsafe unless
synchronized
 Check
documentation for
specific functions
to determine
thread-safety
OpenMP
SIMD
SWP
OOO
ILP
A Simple Test
1. Reverse the loop
order and rerun in
serial
*Exception: Loops with
induction variables
Reverse
2. If results are
unchanged, the loop
is Independent*
for(J=0;J<MAX;J++) {
<...>
compute(J,...)
<...>
}
for(J=MAX-1;J>=0;J--){
<...>
compute(J,...)
<...>
}
Summary
 Loop Independence: Loop Iterations are
independent of each other.
 Explained it’s importance
– ILP and Parallelism
 Identified common causes of loop
dependence
– Flow Dependency, Anti-Dependency, Output
Dependency
 Taught some methods of fixing loop
dependence
 Reinforced concepts through lab