Transcript Document

Compiler and Developing
Tools
Yao-Yuan Chuang
1
Fortran and C Compilers


Freeware
 The GNU Fortran and C compilers, g77, g95, gcc,
and gfortran are popular.
 The port for window is either MinGW or Cygwin.
Proprietary
 Portland Group (http://www.pgroup.com)
 Intel Compiler (http://www.intel.com)
 Absoft
 Lehay
Comparison
(http://fortran2000.com/ArnaudRecipes/CompilerTricks.html)
Fortran under Linux
(http://www/nikhef.nl/~templon/fortran.html)
2
Compiler

Compilers act on Fortran or C source code (*.f or *.c) and
generate assembler source file. (*.s)

An assembler converts the assembler source into an object
file (*.a)

A linker then combines the object files (*.o) and the library
files (*.a) to a executable file.

Without specifying the executable filename, by default it is
called a.out.
3
C vs. Fortran
Fortran
Introduction to Scientific Computing
(http://pro3.chem.pitt.edu/richard/chem3400/)

C/C++
Computational Physics
(http://www.physics.ohiostate.edu/~ntg/780/computational_physics_resources.php)

4
Code Optimization
Yao-Yuan Chuang
5
Code Optimization

A compiler is a program reads the source program written
in high-level language and translate it into machine
language.

“An optimized compiler” generated “optimize” machine
language code which takes less time to run, occupied less
memory or both.

Assembly Language generated from GNU C compiler.
(http://linuxgazette.net/issue71/joshi.html)
6
Code Optimization

Make your code run faster

Make your code use less memory

Make your code use less disk space

Optimization always comes before Parallelization
7
Optimization Strategy
Unoptimized Code
Set up reference case
Compiler optimization
(-O1 to –O3, other options)
Include Numerical Lib
Profiling
Identify bottleneck
Optimization
loop
Optimization Techniques
Check result
Optimized code
8
CPU Insights
Instruction cache
pipeline
FP
unit
FX
unit
Register (L1 cache)
Load/store
unit
FMA
unit
Vector
unit
Load/store
unit
L2/L3 cache
Specialized
unit
Main Memory
9
Memory Access
CPU
L1 cache
1 ~ 4 cycles
L2/L3 cache
TLB
Translation Look-aside Buffer
List of most recently accessed
Memory pages
TLB miss: 30 ~ 60 cycles
8 ~ 20 cycles
RAM
8000 – 120000 cycles
memory
pages
PFT
Page Frame Table
List of memory pages location
Disk
10
Measuring Performance

Measurement Guidelines

Time Measurement

Profiling tools

Hardware Counters
11
Measuring Guidelines

Make sure you have full access to the processor

Always check the correctness of the results

Use all the tools available

Watch out for overhead

Compare to theoretical peak performance
12
Time Measurement
On Linux
% time hello

In Fortran 90
Call SYSTEM_CLOCK(count1,count_rate,count_max)
… calculation
CALL SYSTEM_CLOCK(count2,count_rate,count_max)

Or using etime() function with PGI compiler
13
Components of computing time

User time
The user time corresponds to the amount of time the
instructions in your program taken on the CPU

System time
Most scientific program require to use the OS kernel to
carry out certain tasks, such as I/O. While carrying out
these tasks, your program is not occupying the CPU. The
system time is a measure of the time your program spends
waiting for kernel services.

Elapsed time
The elapsed time corresponds to the wall-clock time, or
real-world time taken by the program.
14
Profiling Tools
% gcc –pg newtest.c –g –o newtest
% gprof newtest > newtest.out
% less newtest.out
From profiling information, we know number of the function calls
and how much time spend in each function, hence, we can
improve the most ‘critical’ step within the program for
optimized performance.
Tells you the portion of time the program spends in each of the
subroutines and/or functions
Mostly useful when your program has a lot of
subroutines and/or functions
Use profiling at the beginning of optimization process
PGI profiling tool is calld pgprof which –Mprof=func
15
Hardware Counters

All modern processors has built-in event counters

Processors may have several registers reserved for
counters

It is possible to start, stop and reset counters

Software API can be used to access counters

Using Hardware Counters is a must in Optimization
16
Software API-PAPI

Performance Application Programming Interface

A standardized API to access hardware counters

Available on most systems: Linux, Windows NT, Solaris, …

Motivation
 To provide solid foundation for cross platform
performance analysis tools
 To present a set of standard definitions for performance
metrics
 To provide a standardized API
 To be easy to use, well documented, and freely available

Web site: http://icl.cs.utk.edu/projects/papi
17
Optimization Techniques

Compiler Options

Use Existing Libraries

Numerical Instabilities

FMA units

Vector units

Array Considerations

Tips and Tricks
18
Compiler Options

Substantial gain can be easily obtained by playing with
compiler options

Optimization options are “a must”. The first and second
level of optimization will rarely give no benefits.

Optimization options can range from –O1 to –O5 with some
compilers. -O3 to –O5 might lead to slower code, so try
them independently on each subroutine.

Always check your results when trying optimization options.

Compiler options might include hardware specifics such as
accessing vector units.
19
Compiler Options
GNU C compiler
gcc
-O0 –O1 –O2 –O3 –finline-functions …

PGI Workstation Compiler
pgcc, pgf90, and pg77
-O0 –O1
-O2
-O3 …

Intel Fortran and C compiler
ifc and icc
-O0 –O1 –O2 –O3 –ip –xW –tpp7 …

20
Existing Libraries

Existing libraries are usually highly optimized

Try several libraries and compare if possible

Recompile libraries on the platform you are running if you
have source

Vendors libraries are usually well optimized for their
platform

Popular mathematical libraries: BLAS, LAPACK, ESSL, FFTW,
MKL, ACML, ATLAS, GSL …

Watch out for cross language (calling Fortran in C or calling
C in Fortran) usage
21
Numerical Instabilities

Specific to each problem

Could lead to much longer time

Could lead to wrong result

Examine the mathematics of the solver

Look for operations involving very large and very small
numbers

Be careful when using higher compiler optimization options
22
FMA units
Y=A*X + B
00011000000010001101000011010101
Y
*
00011000000010001101000011010101
In 1 cycle
X
+
00011000000010001101000011010101
B
23
Vector units
32 bit precision
x1 00011000000010001101000011010101
x2 00011000000010001101000011010101
x3 00011000000010001101000011010101
x4 00011000000010001101000011010101
x1
x2
Op
(+,
-,*)
00011000000010001101000011010101
00011000000010001101000011010101
00011000000010001101000011010101
=
00011000000010001101000011010101
00011000000010001101000011010101
00011000000010001101000011010101
00011000000010001101000011010101
00011000000010001101000011010101
64 bit precision
00011000000010001101000011010101
00011000000010001101000011010101
00011000000010001101000011010101
00011000000010001101000011010101
Op
(+,
-,*)
00011000000010001101000011010101
00011000000010001101000011010101
00011000000010001101000011010101
=
00011000000010001101000011010101
00011000000010001101000011010101
00011000000010001101000011010101
00011000000010001101000011010101
00011000000010001101000011010101
128 bit long vector unit for P4 and Opteron
4 single precision FLOPs/cycle
2 double precision FLOPs/cycle
24
Array Considerations
In Fortran
do i=1,5
do j = 1,5
a(i,j)= …
enddo
enddo
do j=1,5
do i = 1,5
a(i,j)= …
enddo
enddo
In C/C++
for(j=1;j<=5;j++){
for(i=1;i<=5;i++){
a[i][j]=…
}
}
for(i=1;i<=5;i++){
for(j=1;j<=5;j++){
a[i][j]=…
}
}
Corresponding memory representation
Outer 1
Inner 1
1
2
1
3
1
4
1
5
Outer 1 1 1 1 1
Inner 1 2 3 4 5
25
Tips and Tricks

Sparse Arrays

Hard to optimize because needs to jumps when
accessing memory

Minimize the memory jumps

Carefully analyze the construction of the sparse array,
using pointer technique but it can be confusing

Lower your expectation
26
Minimize number of Operations

During optimization, first thing needed to do is reducing the
number of unnecessary operations performed by the CPU.
do k=1,10
do j=1,5000
do i=1,5000
a(i,j,k)=3.0*m*d(k)+c(j)*23.1-b(i)
enddo
enddo
enddo
do k=1,10
dtmp(k)=3.0*m*d(k)
do j=1,5000
ctmp(j)=c(j)*23.1
do i=1,5000
a(i,j,k)=dtmp(k)+ctmp(j)-b(i)
enddo
enddo
enddo
1250 millions of operations
500 millions of operations
27
Complex Numbers

Watch for operations on complex numbers that have imaginary
or real part equals to zero.
! Real part = 0
complex *16 a(1000,1000),b
complex *16 c(1000,1000)
do j=1,1000
do i=1,1000
c(i,j) = a(i,j)*b
enddo
enddo
6 millions of operations
real *8 aI(1000,1000)
complex *16 b,c(1000,1000)
do j=1,1000
do i=1,1000
c(i,j) = (-IMAG(b)*aI(i,j),
aI(I,j)*REAL(b));
enddo
enddo
2 millions of operations
28
Loop Overhead and Object
do j = 1,1000000
do i = 1,1000000
do k = 1,2
a(i,j,k)=b(i,j)*c(k)
enddo
enddo
enddo
do j = 1,1000000
do i = 1,1000000
a(i,j,1)=b(i,j)*c(1)
a(i,j,2)=b(I,j)*c(2)
enddo
enddo
enddo
Object declarations
In Object-Oriented Language AVOID objects
Declarations within the most inner loops
29
Function call Overhead
do k = 1,1000000
do j = 1,1000000
do i = 1,5000
a(i,j,k)=fl(c(i),b(j),k
enddo
enddo
enddo
function fl(x,y,m)
real*8 x,y,tmp
integer m
tmp=x*m-y
return tmp
end
do k = 1,1000000
do j = 1,1000000
do i = 1,5000
a(i,j,k)=c(i)*k-b(j)
enddo
enddo
enddo
This can also be achieved with compilers
inlining options. The compiler will then
replace all function calls by a copy of
the function code, sometimes leading
to very large binary executable.
% ifc –ip
% icc –ip
% gcc –finline-functions
30
Blocking

Blocking is used to reduce cache and TLB misses in nested
matrix operations. The idea is to process as much data
brought in the cache as possible
do i = 1,n
do j = 1,n
do k = 1,n
C(I,j)=C(I,j)+A(I,k)*B(k,j)
enddo
enddo
enddo
do ib = 1,n,bsize
do jb = 1,n,bsize
do kb = 1,n,bsize
do i = ib,min(n,ib+bsize-1)
do j = jb,min(n,jb+bsize-1)
do k = kb,min(n,kb+bsize-1)
C(I,j)=C(I,j)+
A(I,k)*B(k,j)
enddo
enddo
enddo
enddo
enddo
enddo
31
Loop Fusion

The main advantage of loop fusion is the reduction of cache
misses when the same array is used in both loops. It also
reduces loop overhead and allow a better control of multiple
instructions in a single cycle, when hardware allows it.
do i = 1,100000
a = a + x(i) + 2.0 *z(i)
enddo
do i = 1,100000
a = a + x(i) + 2.0 *z(i)
v = 3.0*x(i) – 3.314159267
enddo
do j = 1,100000
v = 3.0*x(j) – 3.314159267
enddo
32
Loop Unrolling

The main advantage of loop unrolling is to reduce or eliminate
data dependencies in loops. This is particularly useful when
using a superscalar architecture.
do i = 1,1000
a = a + x(i) * y(i)
enddo
2000 cycles
do i = 1,1000,4
a = a + x(i) * y(i)
+ x(i+1)* y(i+1)
+ x(i+2)* y(i+2)
+ x(i+3)* y(i+3)
enddo
1250 cycles
2 FMAs
or vector units (length of 2)
33
Sum Reduction

Sum reduction is another way of reducing or eliminating data
dependencies in loops. It is more explicit than the loop unroll.
do i = 1,1000
a = a + x(i) * y(i)
enddo
do i = 1,1000,4
a1 = a1 + x(i) * y(i)
+ x(i+1)* y(i+1)
a2 = a2 + x(i+2)* y(i+2)
+ x(i+3)* y(i+3)
enddo
a = a1 + a2
2000 cycles
751 cycles
2 FMAs
or vector units (length of 2)
34
Better Performance in Math

Replace division by multiplications
Contrary to floating point multiplications, additions, or
subtractions, divisions are very costly in terms of clock cycles.
1 multiplication = 1 cycle, 1 division = 14 ~ 20 cycles.

Repeated multiplications for exponentials
Exponential is a functional call, if the exponent is small,
multiplication should be done manually.
35
Portland Group Compiler

A comprehensive discussion of the Portland Group compiler
optimization options is given in the PGI User’s Guide, available
at http://www.pgroup.com/doc.

Information on how to use Portland Group compiler options
can be obtained on the command line with
% pgf77 –fastsse –help

Detailed information on the optimization and transformation
(i.e. loop unrolling) carried out by the compiler is given by
the –Minfo option. This is often useful when your code
produces unexpected results.
36
Portland Group Compiler

Important compiler optimization options for the Portland Group
compiler include:
-fast
includes “-O2 –Munroll –Mnoframe –Mlre”
-fastsse
includes “-fast –Mvec=sse –Mcache_align”
-Mipa=fast
enables inter-procedural analysis (IPA) and optimization
-Mipa=fast,inline
enables IPA-based optimization and function inlining
-Mpft … -Mpfo
enables profile and data feedback based optimizations
-Minline
inline functions and subroutines
-Mconcur
try to autoparallelize loops for SMP/dual core systems
-mcmodel=medium
enable data > 2GB on opterons running 64-bit linux
A good start for your compilation needs is: -fastsse –Mipa=fast
37
Optimization Levels

With the Portland Group compiler the different optimization levels
correspond to:
-O0
the level-zero flag specifies no optimization. The
intermediate code is generated and used for the
machine code.
-O1
the level-one specifies local optimizations, i.e. local
to a basic block
-O2
the level-two specifies global optimizations. These
optimizations occur over all the basic blocks, and the
control-flow structure.
-O3
level-three specifies an aggressive global
optimization. All level-one and level-two
optimizations are also carried out.
38
-Munroll option

The –Munroll compiler option unrolls loops. This has the effect
of reducing the number of iterations in the loop by executing
multople instances of the loop statements in each iteration.
For example:
do i = 1, 100
z = z + a(i) * b(i)
enddo

do i = 1, 100, 2
z = z + a(i) * b(i)
z = z + a(i+1) * b(i+1)
enddo
Loop unrolling reduces the overhead of maintaining the loop
index, and permits better instruction scheduling (control of
sending the instructions to the CPU).
39
-Mvect=sse option

The Portland Group compiler can be used to vecoroze code.
Vectorization transforms loops to improve memory access
performance (i.e. maximize the usage of the various memory
components, such as registers, cache).

SSE is an acronym for Streaming SIMD Extensions, and is a
set of CPU instructions, first introduced with the Intel Pentium
III and AMD Athlon, which allows for the same operation
acting on multiple data items concurrently.

The use of this compiler option can double the execution speed
of a code.
40
Intermediate Language

The intermediate language used by the compiler is a language
somewhere between the high-level language used by the
programmer (i.e. Fortran, C), and the assembly language used
by the machine.

The intermediate language is easier for the compiler to
manipulate than source code. It contains not only the
algorithm specified in the source code, but expressions for
calculating the memory addresses (which can also be subject
to optimization).

The intermediate language makes it much easier for the
compiler to optimize source code.
41
Intermediate Language - quadruples

Calculations in intermediate languages are simplified into
quadruples. Arithmetic expressions are broken down into
calculations involving only two operands and one operator.
This makes sense when considering how CPU carry out a
calculation. This is simplification is illustrated with the
following expression
A = -B + C * D / E
which can be simplified into quadruples by using temporary
variables:
T1 = D / E
T2 = C * T1
T3 = -B
A = T3 + T2
42
Basic Blocks

A more realistic example of intermediate language is given by
using the example code
do while
k = k
m = j
j = j
enddo

(j .lt. n)
+ j * 2
* 2
+ 1
This code can be broken down into three basic blocks of code.
A basic block is a collection of statements used to define local
variables in compiler optimization. A basic block begins with a
statement that either follows a branch (e.g. an IF), or is itself
the target of a branch. A basic block has only one entrance
(the top), and one exit (the bottom).
43
Basic Block Flow Graph
A::
t1
:=j
t2
:=n
t3
:=t1 .lt. t2
jump (B) t3
jump(C)
B::
TRUE
t4
t5
t6
t7
k
t8
t9
m
t10
t11
j
jump(A)
:=k
:=j
:=t5*2
:=t4+t6
:=t7
:=j
:=t8*2
:=t9
:=j
:=t10+1
:=t11
TRUE
44
Write Efficient C and Code
Optimization

Use unsigned integer instead of Integer

Combining division and remainder

Division and remainder by powers of two

Use switch instead of if … else …

Loop unrolling

Use lookup tables

(http://www.codeproject.com/cpp/C___Code_Optimization.
asp)
45