Transcript Document
Compiler and Developing
Tools
Yao-Yuan Chuang
1
Fortran and C Compilers
Freeware
The GNU Fortran and C compilers, g77, g95, gcc,
and gfortran are popular.
The port for window is either MinGW or Cygwin.
Proprietary
Portland Group (http://www.pgroup.com)
Intel Compiler (http://www.intel.com)
Absoft
Lehay
Comparison
(http://fortran2000.com/ArnaudRecipes/CompilerTricks.html)
Fortran under Linux
(http://www/nikhef.nl/~templon/fortran.html)
2
Compiler
Compilers act on Fortran or C source code (*.f or *.c) and
generate assembler source file. (*.s)
An assembler converts the assembler source into an object
file (*.a)
A linker then combines the object files (*.o) and the library
files (*.a) to a executable file.
Without specifying the executable filename, by default it is
called a.out.
3
C vs. Fortran
Fortran
Introduction to Scientific Computing
(http://pro3.chem.pitt.edu/richard/chem3400/)
C/C++
Computational Physics
(http://www.physics.ohiostate.edu/~ntg/780/computational_physics_resources.php)
4
Code Optimization
Yao-Yuan Chuang
5
Code Optimization
A compiler is a program reads the source program written
in high-level language and translate it into machine
language.
“An optimized compiler” generated “optimize” machine
language code which takes less time to run, occupied less
memory or both.
Assembly Language generated from GNU C compiler.
(http://linuxgazette.net/issue71/joshi.html)
6
Code Optimization
Make your code run faster
Make your code use less memory
Make your code use less disk space
Optimization always comes before Parallelization
7
Optimization Strategy
Unoptimized Code
Set up reference case
Compiler optimization
(-O1 to –O3, other options)
Include Numerical Lib
Profiling
Identify bottleneck
Optimization
loop
Optimization Techniques
Check result
Optimized code
8
CPU Insights
Instruction cache
pipeline
FP
unit
FX
unit
Register (L1 cache)
Load/store
unit
FMA
unit
Vector
unit
Load/store
unit
L2/L3 cache
Specialized
unit
Main Memory
9
Memory Access
CPU
L1 cache
1 ~ 4 cycles
L2/L3 cache
TLB
Translation Look-aside Buffer
List of most recently accessed
Memory pages
TLB miss: 30 ~ 60 cycles
8 ~ 20 cycles
RAM
8000 – 120000 cycles
memory
pages
PFT
Page Frame Table
List of memory pages location
Disk
10
Measuring Performance
Measurement Guidelines
Time Measurement
Profiling tools
Hardware Counters
11
Measuring Guidelines
Make sure you have full access to the processor
Always check the correctness of the results
Use all the tools available
Watch out for overhead
Compare to theoretical peak performance
12
Time Measurement
On Linux
% time hello
In Fortran 90
Call SYSTEM_CLOCK(count1,count_rate,count_max)
… calculation
CALL SYSTEM_CLOCK(count2,count_rate,count_max)
Or using etime() function with PGI compiler
13
Components of computing time
User time
The user time corresponds to the amount of time the
instructions in your program taken on the CPU
System time
Most scientific program require to use the OS kernel to
carry out certain tasks, such as I/O. While carrying out
these tasks, your program is not occupying the CPU. The
system time is a measure of the time your program spends
waiting for kernel services.
Elapsed time
The elapsed time corresponds to the wall-clock time, or
real-world time taken by the program.
14
Profiling Tools
% gcc –pg newtest.c –g –o newtest
% gprof newtest > newtest.out
% less newtest.out
From profiling information, we know number of the function calls
and how much time spend in each function, hence, we can
improve the most ‘critical’ step within the program for
optimized performance.
Tells you the portion of time the program spends in each of the
subroutines and/or functions
Mostly useful when your program has a lot of
subroutines and/or functions
Use profiling at the beginning of optimization process
PGI profiling tool is calld pgprof which –Mprof=func
15
Hardware Counters
All modern processors has built-in event counters
Processors may have several registers reserved for
counters
It is possible to start, stop and reset counters
Software API can be used to access counters
Using Hardware Counters is a must in Optimization
16
Software API-PAPI
Performance Application Programming Interface
A standardized API to access hardware counters
Available on most systems: Linux, Windows NT, Solaris, …
Motivation
To provide solid foundation for cross platform
performance analysis tools
To present a set of standard definitions for performance
metrics
To provide a standardized API
To be easy to use, well documented, and freely available
Web site: http://icl.cs.utk.edu/projects/papi
17
Optimization Techniques
Compiler Options
Use Existing Libraries
Numerical Instabilities
FMA units
Vector units
Array Considerations
Tips and Tricks
18
Compiler Options
Substantial gain can be easily obtained by playing with
compiler options
Optimization options are “a must”. The first and second
level of optimization will rarely give no benefits.
Optimization options can range from –O1 to –O5 with some
compilers. -O3 to –O5 might lead to slower code, so try
them independently on each subroutine.
Always check your results when trying optimization options.
Compiler options might include hardware specifics such as
accessing vector units.
19
Compiler Options
GNU C compiler
gcc
-O0 –O1 –O2 –O3 –finline-functions …
PGI Workstation Compiler
pgcc, pgf90, and pg77
-O0 –O1
-O2
-O3 …
Intel Fortran and C compiler
ifc and icc
-O0 –O1 –O2 –O3 –ip –xW –tpp7 …
20
Existing Libraries
Existing libraries are usually highly optimized
Try several libraries and compare if possible
Recompile libraries on the platform you are running if you
have source
Vendors libraries are usually well optimized for their
platform
Popular mathematical libraries: BLAS, LAPACK, ESSL, FFTW,
MKL, ACML, ATLAS, GSL …
Watch out for cross language (calling Fortran in C or calling
C in Fortran) usage
21
Numerical Instabilities
Specific to each problem
Could lead to much longer time
Could lead to wrong result
Examine the mathematics of the solver
Look for operations involving very large and very small
numbers
Be careful when using higher compiler optimization options
22
FMA units
Y=A*X + B
00011000000010001101000011010101
Y
*
00011000000010001101000011010101
In 1 cycle
X
+
00011000000010001101000011010101
B
23
Vector units
32 bit precision
x1 00011000000010001101000011010101
x2 00011000000010001101000011010101
x3 00011000000010001101000011010101
x4 00011000000010001101000011010101
x1
x2
Op
(+,
-,*)
00011000000010001101000011010101
00011000000010001101000011010101
00011000000010001101000011010101
=
00011000000010001101000011010101
00011000000010001101000011010101
00011000000010001101000011010101
00011000000010001101000011010101
00011000000010001101000011010101
64 bit precision
00011000000010001101000011010101
00011000000010001101000011010101
00011000000010001101000011010101
00011000000010001101000011010101
Op
(+,
-,*)
00011000000010001101000011010101
00011000000010001101000011010101
00011000000010001101000011010101
=
00011000000010001101000011010101
00011000000010001101000011010101
00011000000010001101000011010101
00011000000010001101000011010101
00011000000010001101000011010101
128 bit long vector unit for P4 and Opteron
4 single precision FLOPs/cycle
2 double precision FLOPs/cycle
24
Array Considerations
In Fortran
do i=1,5
do j = 1,5
a(i,j)= …
enddo
enddo
do j=1,5
do i = 1,5
a(i,j)= …
enddo
enddo
In C/C++
for(j=1;j<=5;j++){
for(i=1;i<=5;i++){
a[i][j]=…
}
}
for(i=1;i<=5;i++){
for(j=1;j<=5;j++){
a[i][j]=…
}
}
Corresponding memory representation
Outer 1
Inner 1
1
2
1
3
1
4
1
5
Outer 1 1 1 1 1
Inner 1 2 3 4 5
25
Tips and Tricks
Sparse Arrays
Hard to optimize because needs to jumps when
accessing memory
Minimize the memory jumps
Carefully analyze the construction of the sparse array,
using pointer technique but it can be confusing
Lower your expectation
26
Minimize number of Operations
During optimization, first thing needed to do is reducing the
number of unnecessary operations performed by the CPU.
do k=1,10
do j=1,5000
do i=1,5000
a(i,j,k)=3.0*m*d(k)+c(j)*23.1-b(i)
enddo
enddo
enddo
do k=1,10
dtmp(k)=3.0*m*d(k)
do j=1,5000
ctmp(j)=c(j)*23.1
do i=1,5000
a(i,j,k)=dtmp(k)+ctmp(j)-b(i)
enddo
enddo
enddo
1250 millions of operations
500 millions of operations
27
Complex Numbers
Watch for operations on complex numbers that have imaginary
or real part equals to zero.
! Real part = 0
complex *16 a(1000,1000),b
complex *16 c(1000,1000)
do j=1,1000
do i=1,1000
c(i,j) = a(i,j)*b
enddo
enddo
6 millions of operations
real *8 aI(1000,1000)
complex *16 b,c(1000,1000)
do j=1,1000
do i=1,1000
c(i,j) = (-IMAG(b)*aI(i,j),
aI(I,j)*REAL(b));
enddo
enddo
2 millions of operations
28
Loop Overhead and Object
do j = 1,1000000
do i = 1,1000000
do k = 1,2
a(i,j,k)=b(i,j)*c(k)
enddo
enddo
enddo
do j = 1,1000000
do i = 1,1000000
a(i,j,1)=b(i,j)*c(1)
a(i,j,2)=b(I,j)*c(2)
enddo
enddo
enddo
Object declarations
In Object-Oriented Language AVOID objects
Declarations within the most inner loops
29
Function call Overhead
do k = 1,1000000
do j = 1,1000000
do i = 1,5000
a(i,j,k)=fl(c(i),b(j),k
enddo
enddo
enddo
function fl(x,y,m)
real*8 x,y,tmp
integer m
tmp=x*m-y
return tmp
end
do k = 1,1000000
do j = 1,1000000
do i = 1,5000
a(i,j,k)=c(i)*k-b(j)
enddo
enddo
enddo
This can also be achieved with compilers
inlining options. The compiler will then
replace all function calls by a copy of
the function code, sometimes leading
to very large binary executable.
% ifc –ip
% icc –ip
% gcc –finline-functions
30
Blocking
Blocking is used to reduce cache and TLB misses in nested
matrix operations. The idea is to process as much data
brought in the cache as possible
do i = 1,n
do j = 1,n
do k = 1,n
C(I,j)=C(I,j)+A(I,k)*B(k,j)
enddo
enddo
enddo
do ib = 1,n,bsize
do jb = 1,n,bsize
do kb = 1,n,bsize
do i = ib,min(n,ib+bsize-1)
do j = jb,min(n,jb+bsize-1)
do k = kb,min(n,kb+bsize-1)
C(I,j)=C(I,j)+
A(I,k)*B(k,j)
enddo
enddo
enddo
enddo
enddo
enddo
31
Loop Fusion
The main advantage of loop fusion is the reduction of cache
misses when the same array is used in both loops. It also
reduces loop overhead and allow a better control of multiple
instructions in a single cycle, when hardware allows it.
do i = 1,100000
a = a + x(i) + 2.0 *z(i)
enddo
do i = 1,100000
a = a + x(i) + 2.0 *z(i)
v = 3.0*x(i) – 3.314159267
enddo
do j = 1,100000
v = 3.0*x(j) – 3.314159267
enddo
32
Loop Unrolling
The main advantage of loop unrolling is to reduce or eliminate
data dependencies in loops. This is particularly useful when
using a superscalar architecture.
do i = 1,1000
a = a + x(i) * y(i)
enddo
2000 cycles
do i = 1,1000,4
a = a + x(i) * y(i)
+ x(i+1)* y(i+1)
+ x(i+2)* y(i+2)
+ x(i+3)* y(i+3)
enddo
1250 cycles
2 FMAs
or vector units (length of 2)
33
Sum Reduction
Sum reduction is another way of reducing or eliminating data
dependencies in loops. It is more explicit than the loop unroll.
do i = 1,1000
a = a + x(i) * y(i)
enddo
do i = 1,1000,4
a1 = a1 + x(i) * y(i)
+ x(i+1)* y(i+1)
a2 = a2 + x(i+2)* y(i+2)
+ x(i+3)* y(i+3)
enddo
a = a1 + a2
2000 cycles
751 cycles
2 FMAs
or vector units (length of 2)
34
Better Performance in Math
Replace division by multiplications
Contrary to floating point multiplications, additions, or
subtractions, divisions are very costly in terms of clock cycles.
1 multiplication = 1 cycle, 1 division = 14 ~ 20 cycles.
Repeated multiplications for exponentials
Exponential is a functional call, if the exponent is small,
multiplication should be done manually.
35
Portland Group Compiler
A comprehensive discussion of the Portland Group compiler
optimization options is given in the PGI User’s Guide, available
at http://www.pgroup.com/doc.
Information on how to use Portland Group compiler options
can be obtained on the command line with
% pgf77 –fastsse –help
Detailed information on the optimization and transformation
(i.e. loop unrolling) carried out by the compiler is given by
the –Minfo option. This is often useful when your code
produces unexpected results.
36
Portland Group Compiler
Important compiler optimization options for the Portland Group
compiler include:
-fast
includes “-O2 –Munroll –Mnoframe –Mlre”
-fastsse
includes “-fast –Mvec=sse –Mcache_align”
-Mipa=fast
enables inter-procedural analysis (IPA) and optimization
-Mipa=fast,inline
enables IPA-based optimization and function inlining
-Mpft … -Mpfo
enables profile and data feedback based optimizations
-Minline
inline functions and subroutines
-Mconcur
try to autoparallelize loops for SMP/dual core systems
-mcmodel=medium
enable data > 2GB on opterons running 64-bit linux
A good start for your compilation needs is: -fastsse –Mipa=fast
37
Optimization Levels
With the Portland Group compiler the different optimization levels
correspond to:
-O0
the level-zero flag specifies no optimization. The
intermediate code is generated and used for the
machine code.
-O1
the level-one specifies local optimizations, i.e. local
to a basic block
-O2
the level-two specifies global optimizations. These
optimizations occur over all the basic blocks, and the
control-flow structure.
-O3
level-three specifies an aggressive global
optimization. All level-one and level-two
optimizations are also carried out.
38
-Munroll option
The –Munroll compiler option unrolls loops. This has the effect
of reducing the number of iterations in the loop by executing
multople instances of the loop statements in each iteration.
For example:
do i = 1, 100
z = z + a(i) * b(i)
enddo
do i = 1, 100, 2
z = z + a(i) * b(i)
z = z + a(i+1) * b(i+1)
enddo
Loop unrolling reduces the overhead of maintaining the loop
index, and permits better instruction scheduling (control of
sending the instructions to the CPU).
39
-Mvect=sse option
The Portland Group compiler can be used to vecoroze code.
Vectorization transforms loops to improve memory access
performance (i.e. maximize the usage of the various memory
components, such as registers, cache).
SSE is an acronym for Streaming SIMD Extensions, and is a
set of CPU instructions, first introduced with the Intel Pentium
III and AMD Athlon, which allows for the same operation
acting on multiple data items concurrently.
The use of this compiler option can double the execution speed
of a code.
40
Intermediate Language
The intermediate language used by the compiler is a language
somewhere between the high-level language used by the
programmer (i.e. Fortran, C), and the assembly language used
by the machine.
The intermediate language is easier for the compiler to
manipulate than source code. It contains not only the
algorithm specified in the source code, but expressions for
calculating the memory addresses (which can also be subject
to optimization).
The intermediate language makes it much easier for the
compiler to optimize source code.
41
Intermediate Language - quadruples
Calculations in intermediate languages are simplified into
quadruples. Arithmetic expressions are broken down into
calculations involving only two operands and one operator.
This makes sense when considering how CPU carry out a
calculation. This is simplification is illustrated with the
following expression
A = -B + C * D / E
which can be simplified into quadruples by using temporary
variables:
T1 = D / E
T2 = C * T1
T3 = -B
A = T3 + T2
42
Basic Blocks
A more realistic example of intermediate language is given by
using the example code
do while
k = k
m = j
j = j
enddo
(j .lt. n)
+ j * 2
* 2
+ 1
This code can be broken down into three basic blocks of code.
A basic block is a collection of statements used to define local
variables in compiler optimization. A basic block begins with a
statement that either follows a branch (e.g. an IF), or is itself
the target of a branch. A basic block has only one entrance
(the top), and one exit (the bottom).
43
Basic Block Flow Graph
A::
t1
:=j
t2
:=n
t3
:=t1 .lt. t2
jump (B) t3
jump(C)
B::
TRUE
t4
t5
t6
t7
k
t8
t9
m
t10
t11
j
jump(A)
:=k
:=j
:=t5*2
:=t4+t6
:=t7
:=j
:=t8*2
:=t9
:=j
:=t10+1
:=t11
TRUE
44
Write Efficient C and Code
Optimization
Use unsigned integer instead of Integer
Combining division and remainder
Division and remainder by powers of two
Use switch instead of if … else …
Loop unrolling
Use lookup tables
(http://www.codeproject.com/cpp/C___Code_Optimization.
asp)
45