Transcript Document

Compiler and Developing Tools

Yao-Yuan Chuang

1

Fortran and C Compilers

  Freeware  The GNU Fortran and C compilers, g77, g95, gcc, and gfortran are popular.  The port for window is either MinGW or Cygwin.

Proprietary     Portland Group ( http://www.pgroup.com

) Intel Compiler ( http://www.intel.com

) Absoft Lehay Comparison ( http://fortran 2000.com/ArnaudRecipes/CompilerTricks.html

) Fortran under Linux (http://www/nikhef.nl/~templon/fortran.html) 2

Compiler

 Compilers act on Fortran or C source code (*.f or *.c) and generate assembler source file. (*.s)  An assembler converts the assembler source into an object file (*.a)  A linker then combines the object files (*.o) and the library files (*.a) to a executable file.

 Without specifying the executable filename, by default it is called a.out.

3

C vs. Fortran

 Fortran Introduction to Scientific Computing ( http://pro3.chem.pitt.edu/richard/chem3400/ )  C/C++ Computational Physics (http://www.physics.ohio state.edu/~ntg/780/computational_physics_resources.php) 4

Developing Tools

 Gnu Make ( http://www.oreilly.com/catalog/make3/book/index.csp

)  Version control   CVS ( http://cvsbook.red-bean.com/ ) SVN ( http://svnbook.red-bean.com/ )  Documentation  Doxygen ( http://www.stack.nl/~dimitri/doxygen/ )  Patch and tar  Integrated Development Environment  Gnu debugger (http://sourceware.org/gdb/current/onlinedocs/) 5

Option of Make and Syntax of Makefile

  Make reads in a Makefile and creates the target Make has many options, you can use ‘man’ command to check the options. For example:    -k continue to execute while -n -f use the makefile named  Syntax of a makefile    A makefile consists of dependencies and rules.

A dependency has a target and corresponding source files A rule has to be in the same line, at the beginning of the line, you need to use ‘tab’ (this is very important) myapp: main.o 2.o 3.o

main.o: main.c a.h

2.o: 2.c a.h b.h

3.o: 3.c b.h c.h

all: myapp myapp.1

6

Make and Makefile

Makefile contains dependencies and rules

myapp: main.o 2.o 3.o

gcc –o myapp main.o 2.o 3.o

main.o: main.c a.h

gcc –c main.c

2.o: 2.c a.h b.h

gcc –c 2.c

3.o: 3.c b.h c.h

gcc –c 3.c

clean: /bin/rm *.o myapp Be very careful while Using ‘ tab ’ and ‘ space ’ 7

Makefile

all: myapp # which compiler CC = gcc # where are include files kept INCLUDE = .

# Options for development CFLAGS = -g –Wall –ansi # options for relase # CFLAGS = -O –Wall –ansi myapp: main.o 2.o 3.o

$(CC) –o myapp main.o 2.o 3.o

main.o: main.c a.h

$(CC) –I$(INCLUDE) $(CFLAGS) –c main.c

2.o: 2.c a.h b.h

$(CC) –I$(INCLUDE) $(CFLAGS) –c 2.c

3.o: 3.c b.h c.h

$(CC) –I$(INCLUDE) $(CFLAGS) –c 3.c

clean: /bin/rm myapp *.o

Use MACRO: CC, INCLUDE, CFLAGS Internal MACRO $<: current dependenices $*: current dependencies without suffix $@: name of current dependencies 8

Makefile with Multiple Target

all: myapp # which compiler CC = gcc # where to install INSTDIR = /usr/local/bin # where are include files kept INCLUDE = .

# Options for development CFLAGS = -g –Wall –ansi # options for relase # CFLAGS = -O –Wall –ansi myapp: main.o 2.o 3.o

$(CC) –o myapp main.o 2.o 3.o

main.o: main.c a.h

$(CC) –I$(INCLUDE) $(CFLAGS) –c main.c

2.o: 2.c a.h b.h

$(CC) –I$(INCLUDE) $(CFLAGS) –c 2.c

3.o: 3.c b.h c.h

$(CC) –I$(INCLUDE) $(CFLAGS) –c 3.c

9

Makefile with Multiple Target

clean: /bin/rm *.o myapp install: myapp @if [ -d $(INSTDIR) ]; \ then \ cp myapp $(INSTDIR); \ chmod a+x $(INSTDIR)/myapp; \ chmod og-w $(INSTDIR)/myapp; \ echo “Installed in $(INSTDIR); \ else \ echo “Sorry, $(INSTDIR) does not exist”; \ fi 10

Makefile with Multiple Target

clean: /bin/rm *.o myapp install: myapp @if [ -d $(INSTDIR) ]; \ then \ cp myapp $(INSTDIR) && \ chmod a+x $(INSTDIR)/myapp && \ chmod og-w $(INSTDIR)/myapp echo “Installed in $(INSTDIR); \ else \ && \ echo “Sorry, $(INSTDIR) does not exist”; \ fi Use && for the command to execute after the previous command had executed successfully.

11

Suffix and Pattern

Change all the .cpp file to .o file .SUFFIXES: .cpp

.cpp.o: $(CC) –xc++ $(CFLAGS) –I$(INCLUDE) –c $< Archive *.o files to lib.a (a library file) .c.a: $(CC) –c $(CFLAGS) $< $(AR) $(ARFLAGS) $@ $*.o

12

Use Makefile to Manage Library

all: myapp # which compiler CC = gcc # where to install INSTDIR = /usr/local/bin # where are the include files INCLUDE = .

# options for development CFLAGS = -g –Wall –ansi # Option for release # CFLAGS = -O –Wall –ansi #Local Libraries MYLIB = mylib.a

main.o: main.c a.h

2.o: 2.c a.h b.h

3.o: 3.c b.h c.h

clean: -rm main.o 2.o 3.o $(MYLIB) install: myapp @if [ -d $(INSTDIR) ] ; \ then \ cp myapp $(INSTDIR); \ chmod a+x $(INSTDIR)/myapp;\ else \ fi myapp: main.o $(MYLIB) $(CC) –o myapp main.o $(MYLIB) $(MYLIB): $(MYLIB) (2.o) $(MYLIB) (3.o) chmod og-w $(INSTDIR)/myapp;\ echo “Installed in $(INSTDIR)”;\ echo “Sorry, $(INSTDIR) \ does not exist”;\ 13

Revision Control System (RCS)

% rcs –i important.c

% ls –l important.c important.c,v % ci important.c # check in file % co –l important.c # check out file % rlog important.c # check the log % co –r1.1 important.c # back to release 1 % ci –r2 important.c # check in as release 2 % rcsdiff –r1.1 –r1.2 important.c # check the update % co –l important.c # verify the release number % ident Source Code Contril System (SCCS) works the same 14

Version Control System

 Since you might generate more than one version, it is necessary to “ control ” versions efficiently, specially, for the case with more than one developer.

 Software packages  Bitkeeper      Microsoft VSS RCS SCCS CVS SVN 15

Version Control System

Problem to avoid Lock-modify-unlock scheme 16

Version Control System

Copy-modify-merge scheme 17

Using CVS

 Log in as super user and create repository % mkdir /usr/local/repository % chgrp users /user/local/repository  % chmod g+w /usr/local/repository Log in as normal user initialize repository % cvs –d /usr/local/repository init (export CVSROOT=/usr/local/repository without -d)  Import new version %cvs import –m”Initial version of Simple Project” test ychuang start   Check out % cvs checkout test Check difference % cvs diff    Check in % cvs commit Check difference % cvs rdiff –r1.1 test Update % cvs update –Pd test 18

CVS

 Access CVS via network (http://www.cvshome.org) % cvs login  Graphical user interface gCVS ( http://wincvs.org

)  Similar tools:   Bitkeeper ( http://www.bitkeeper.com

) SVN 19

Documentation

Good documenting can help future development of the software package

Manual Page

 The manual page is generate by “ nroff ” similar to “ groff ” tool which is 

Programmer

s guide

 Use doxygen package can help the consistent of the document and the comments within the source code 

User

s guide

 Help user to use the program 20

Manual Page

 Manual page contains: header, name, synopsis, description, options, files, see also, bugs. We use ‘ groff – Tascii – man myapp.1

’ command to process the manual page (or – Tps) .TH MYAPP 1 .

SH NAME Myapp \- A simple demo .

SH SYNOPSIS .B myapp [\-option …] .SH DESCRIPTION .PP

\fImya\fP is a complete application .PP

It was written for demo .SH OPTIONS .PP

List of options .TP

.BI \-option If there was an option .SH RESOURCES .PP

Myapp uses no resource .SH DIAGONOSTICS The program … .SH SEE ALSO … .SH COPYRIGHT myapp is copyrighted .SH BUGS There is no bug .SH AUTHORS CSSD myapp.1

21

Distributing

 Patch  Generated by “ diff ” % diff file1.c file2.c > diffs % patch file1.c diffs (update) % patch –R file1.c diffs (reverse) This is file one Line 2 Line 3 There is no line 4, this is line 5 Line 6 This is file two Line 2 Line 3 Line 4 Line 5 Line 6 A new line 8 1c1 < This is file one >This is file two < There is no line 4, this is line 5 > Line 4 > Line 5 > 5a7 > A new line 8 22

Distributing

 tar (tape archive) Create a tarball % tar cvf myapp-1.0.tar main.c 2.c 3.c *.h myapp.1 Makefile Compress the tarball to save space % gzip myapp-1.0.tar

Or we can use the compact command % tar zcvf myapp-1.0.tgz main.c 2.c 3.c *.h myapp.1 Makefile To untar the file % tar zxvf myapp-1.0.tgz

23

Distributing

 RPM (redhat package manager) Install RPM % rpm –Uhv mysql-server-3.32.54a-11.i386.rpm

Inquiry % rpm –qa mysql-server Remove % rpm –rv mysql-server To create RPM, one needs to use the ‘rpmbuild’ command which requires 1) gather the software 2) create a RPM SPEC file 24

Gather Software

 Gather the software as a tarball by adding a target in Makefile all: myapp CC = gcc INCLUDE =.

CFLAGS = -g –Wall –ansi MYLIB = mylib.a

myapp: main.o $(MYLIB) $(CC) –o myapp main.o $(MYLIB) main.o: main.c a.h

2.o: 2.c a.h b.h

3.o: 3.c b.h c.h

clean: -rm main.o 2.o 3.o $(MYLIB) dist: myapp-1.0.tar.gz

myapp-1.0.tar.gs: myapp myapp.1

-rm –rf myapp-1.0

mkdir myapp-1.0

cp *.c *.h *.1 Makefile myapp-1.0

tar zcvf $@ myapp-1.0

% make dist To generate tarball All items in current directory 25

Gather Software

 Copy file to /usr/src/redhat/SOURCES % cp myapp-1.0.tgz /usr/src/redhat/SOURCES  In /usr/src/redhat  BUILD    RPMS SOURCES SPECS  SRPMS where the build software stored store the binary RPM file store the source code store the SPEC file store the source RPM file 26

Create a RPM Spec File

# spec file for myapp Vendor: Distribution: Name: NUK any myapp Version: Release: Packager: License: Group: 1.0

1 [email protected]

Copyright 2007 NUK-HPC preamble Provides: Requires: goodness mysql >= 3.23

27

Create RPM Spec File

Buildroot: source: %{_tmppath}%{name}-%{version}-root %{name}-%{version}.tar.gz

Summary: Trivial application %description Myapp Trivial Application %prep %setup -q %build make 28

Create RPM Spec File

%install mkdir –p %RPM_BUILD_ROOT%{_bindir} mkdir –p %RPM_BUILD_ROOT%{_mandir} install –m755 myapp $RPM_BUILD_ROOT%{_bindir}/myapp install –m755 myapp.1 $RPM_BUILD_ROOT%{_mandir}/myapp.1

%clean rm –rf $RPM_BUILD_ROOT %post mail root –s “myapp installed – please register” < /dev/null %files %{_bindir}/myapp %{_mandir}/myapp.1

29

Use Rpmbuild

 To generate the rpm file % rpmbuild –ba myapp.spec

Generated  myapp-1.0.1-i386.rpm

myapp-1.0.1-src.rpm

Options  -ba  -bb build all build binary      -bc -bp -bi -bl -bs build compiled build prepared build and install check rpm listing build source rpm 30

Integrated Development Environment (IDE)

       Dev-C++ (http://csjava.occ.cccd.edu/~gilberts/devcpp5/) XWPE (http://www.identicalsoftware.com/xwpe) C-FORGE (http://www.codeforge.com) KDevelop (http://www.kdevelop.org) Kylix (http://www.borland.com/kylix) Eclipse and Photran ( http://www.eclipse.org

) SUN Studio ( http://developers.sun.com/sunstudio/ ) 31

Debugging

Errors causing bugs

 Error in specification   Error in design Error in implementation 

Five Stages

 Testing: find bug     Stabilization: repeat bug Localization: identify bug Correction: fix bug Verification: check fix 

Debugging

 Code inspection   Trial and error Debug instrumentation  Controlled execution 32

Example

/* 1 */ typedef struct{ /* 2 */ char *data; /* 3 */ /* 4 */ }item; int key; /* 5 */ /* 6 */ item array[]={ /* 7 */ {“bill”,3}, /* 8 */ {“neil”,4}, /* 9 */ {“john”,2}, /* 10 */ {“rick”,5}, /* 11 */ {“alex”,1}, /* 12 */ }; /* 13 */ /* 14 */ sort(a,n); /* 15 */ item *a; /* 16 */ { /* 17 */ int i=0,j=0; /* 18 */ int s =1; /* 19 */ /* 20 */ for (;ia[j+1].key){ /* 24 */ item t = a[j]; /* 25 */ a[j] = a[j+1]; /* 26 */ a[j+1] = t; /* 27 */ s++; /* 28 */ } /* 29 */ } /* 30 */ n--; /* 31 */ } /* 32 */ } /* 33 */ /* 34 */ main () /* 35 */ { /* 36 */ sort (array,5); /* 37 */ } 33

Add Few Line to Print Info

/* 34 */ main () /* 35 */ { /* 36 */ int i; /* 37 */ sort (array,5); /* 38 */ for (i=0;i<5;i++) /* 39 */ printf(“array[%d] = {%s, %d}\n”, /* 40 */ i,array[i].data,array[i].key); /* 41 */ } % cc debug2.c –o debug2 % ./debug2 Or other machine generated Segmentation Fault array[0] = {john,2} array[1] = {alex,1} array[2] = {{null},-1} array[3] = {bill,3} array[4] = {neil, 4} But we expected array[0] = {alex,1} array[1] = {john,2} array[2] = {bill,3} array[3] = {neil,4} array[4] = {rick,5}

What went wrong ????

34

New Change

For line 2 we change to /* 2 */ char data[4096]; % cc –o debug3 debug3.c

% ./debug3 Segmentation Fault (core dumped)

Code inspection

normally compiler will generate warnings about the errors in the command

Instrumentation

by adding extra code to collect the behavior during execution, for example, adding printf to dump the current values of the variables 35

Instrumentation

 In C, we use the preprocessor to add the debug source code, which can be activated during compilation (-DDEBUG) #ifdef DEBUG printf(“Variable x has value = %d\n”,x); #endif #include int main() { #ifdef DEBUG printf (“Compiled : “ _DATE_ ” at “ _TIME_ “\n”); printf (“This is line %d of file %s\n”,_LINE_,_FILE); #endif printf (“Hello World\n”); exit (0); } 36

Instrumentation

% cc –o cinfo –DDEBUG cinfo.c

% ./cinfo Compiled: Mar 22 2007 at 23:00:00 This is line 7 of file cinfo.c

Hello World % _LINE_ line number _FILE_ filename _DATE_ Mmm dd yyyy _TIME_ hh:mm:ss Or we can define a global variable “debug” to avoid recompilation each time If (debug) { sprintf (msg,…); write_debug(msg); } 37

Controlled Execution

 We use GNU debugger (gdb) to set break points and dump information during execution % cc –g –o debug3 debug3.c

% gdb debug3 (gdb) help (gdb) run executing debug3 and stop at line 23 (gdb) backtrace find out where went wrong (gdb) print j dump variable j (gdb) print a[3] dump array[3] (gdb) list listing source code around line 23 38

Controlled Execution

 We change and store as debug4.c

/* 22 */ for (j=0; j

% ./debug4 Still wrong!

Now, we set break point % gdb debug4 (gdb) break 20 (gdb) run (gdb) print array[0] # print array[0] (gdb) print array[0]@5 # print 5 elements (gdb) cont # continue (gdb) display array[0]@5 # automatically print when hit the break point 39 (gdb) cont

Fix Program in GDB

(gdb) info display # information on display (gdb) info break # information on break pt (gdb) disable break 1 # disable break pt (gdb) disable display 1 # disable display (gdb) break 30 (gdb) command 2 > set variable n = n+1 > cont > end (gdb) run 40

Graphical GDB

 DDD ( http://www.gnu.org/software/ddd/ ) ( http://www.linuxfocus.org/English/January1998/article20.

html )  KDbg ( http://www.kdbg.org/download.php

)  GVD ( https://libre.adacore.com/gvd/ )  XXgdb (http://www.ee.ryerson.ca:8080/~elf/xapps/Q-IV.html) 41

Other Debugger

 splint : remove error in source code ( http://www.splint.org

)  ctags : create ‘ tags ’ (like index) of a source code  cxref : create reference table of a source code  cflow : create the tree structure of a source code If is not in your distribution then check http://rufus.w3.org/linux/RPM or google it 42

prof/gprof: Execution Profiling

% gcc –pg ptest.c –g –o ptest % gprof ptest > ptest.out

% less ptest.out

From profiling information, we know number of the function calls and how much time spend in each function, hence, we can improve the most ‘critical’ step within the program for optimized performance.

43

Assertion

Instead of using printf function for debugging, we an use the macro assert #include #include #include double my_sqrt (double x) { } int main () { } assert(x>= 0.0); return sqrt(x); printf(“sqrt +2 = %g\n”, my_sqrt(2.0)); printf(“sqrt –2 = %d\n”, my_sqrt(-2,0)); assert.c

% cc –o assert assert.c –lm % ./assert sqrt +2 = 1.41421

Assert: assert.c:7 my_sqrt: Assertion ‘x >=0.0’ failed Aborted % … % cc –o assert –DNDEBUG assert.c –lm % ./assert sqrt +2 = 1.41421

Floating point exeception % (or sometimes returns sqrt -2 = nan) (also can remove debug by setting #define NDEBUG) 44

Memory Leak

  Memeory leak : once you ‘ malloc ’ ‘ free ’ out ’ .

the memory but forget to it; it is hard to debug but it is very important to ‘ find Electric Fence by Bruce Perens is a useful tool #include #include int main () { } char *prt = (char *) malloc(1024); ptr[0] = 0; /* now write beyond the block */ ptr [1024] = 0; exit (0); efence.c

% cc –o efence efence.c

% ./efence Linking with Electric Fence % cc –o efence efence.c –lefence % ./efence 45 % cc –g –o efence efence.c –lefence

Memory Leak

 valgrind is another tool, you don ’ t need to recompile the code while using it (http://developer.kde.org/~sewardj) #include #include int main () { char ch; % valgrind – leak-check=yes – v ./checker % valgrind --help char *prt = (char *) malloc(1024); /* Uninitialized read */ ch = ptr[1024]; /* now write beyond the block */ ptr [1024] = 0; /* orphan block */ ptr = 0; exit (0); } checker.c

46

Code Optimization

Yao-Yuan Chuang

47

Code Optimization

 A compiler is a program reads the source program written in high-level language and translate it into machine language.

 “ An optimized compiler memory or both.

” generated “ optimize ” machine language code which takes less time to run, occupied less  Assembly Language generated from GNU C compiler. (http://linuxgazette.net/issue71/joshi.html) 48

Code Optimization

 Make your code run faster  Make your code use less memory  Make your code use less disk space  Optimization always comes before Parallelization 49

Optimization Strategy

Unoptimized Code Set up reference case Compiler optimization (-O1 to –O3, other options) Include Numerical Lib Profiling Identify bottleneck Optimization Techniques Check result Optimization loop Optimized code 50

CPU Insights

Instruction cache Register (L1 cache) pipeline Load/store unit Load/store unit FP unit FX unit FMA unit Vector unit Specialized unit Main Memory L2/L3 cache 51

Memory Access

CPU L1 cache 1 ~ 4 cycles L2/L3 cache

TLB Translation Look-aside Buffer

List of most recently accessed Memory pages TLB miss: 30 ~ 60 cycles 8 ~ 20 cycles RAM

PFT Page Frame Table

List of memory pages location 8000 – 120000 cycles memory pages Disk 52

Measuring Performance

 Measurement Guidelines  Time Measurement  Profiling tools  Hardware Counters 53

Measuring Guidelines

 Make sure you have full access to the processor  Always check the correctness of the results  Use all the tools available  Watch out for overhead  Compare to theoretical peak performance 54

Time Measurement

On Linux

% time hello 

In Fortran 90

Call SYSTEM_CLOCK(count1,count_rate,count_max) … calculation CALL SYSTEM_CLOCK(count2,count_rate,count_max) Or using etime() function with PGI compiler 55

Components of computing time

 User time The user time corresponds to the amount of time the instructions in your program taken on the CPU  System time Most scientific program require to use the OS kernel to carry out certain tasks, such as I/O. While carrying out these tasks, your program is not occupying the CPU. The system time is a measure of the time your program spends waiting for kernel services.

 Elapsed time The elapsed time corresponds to the wall-clock time, or real-world time taken by the program.

56

Profiling Tools

% gcc –pg newtest.c –g –o newtest % gprof newtest > newtest.out

% less newtest.out

From profiling information, we know number of the function calls and how much time spend in each function, hence, we can improve the most ‘critical’ step within the program for optimized performance.

Tells you the portion of time the program spends in each of the subroutines and/or functions Mostly useful when your program has a lot of subroutines and/or functions Use profiling at the beginning of optimization process PGI profiling tool is calld pgprof which –Mprof=func 57

Hardware Counters

 All modern processors has built-in event counters  Processors may have several registers reserved for counters  It is possible to start, stop and reset counters  Software API can be used to access counters  Using Hardware Counters is a must in Optimization 58

Software API-PAPI

Performance Application Programming Interface  A standardized API to access hardware counters  Available on most systems: Linux, Windows NT, Solaris, …  Motivation  To provide solid foundation for cross platform performance analysis tools    To present a set of standard definitions for performance metrics To provide a standardized API To be easy to use, well documented, and freely available  Web site: http://icl.cs.utk.edu/projects/papi 59

Optimization Techniques

 Compiler Options  Use Existing Libraries  Numerical Instabilities  FMA units  Vector units  Array Considerations  Tips and Tricks 60

Compiler Options

 Substantial gain can be easily obtained by playing with compiler options  Optimization options are “ a must ” . The first and second level of optimization will rarely give no benefits.

 Optimization options can range from – O1 to – O5 with some compilers. -O3 to – O5 might lead to slower code, so try them independently on each subroutine.

 Always check your results when trying optimization options.

 Compiler options might include hardware specifics such as accessing vector units.

61

Compiler Options

 GNU C compiler gcc -O0 –O1 –O2 –O3 –finline-functions …  PGI Workstation Compiler pgcc, pgf90, and pg77 -O0 –O1 -O2 -O3 …  Intel Fortran and C compiler ifc and icc -O0 –O1 –O2 –O3 –ip –xW –tpp7 … 62

Existing Libraries

 Existing libraries are usually highly optimized  Try several libraries and compare if possible  Recompile libraries on the platform you are running if you have source  Vendors libraries are usually well optimized for their platform  Popular mathematical libraries: BLAS, LAPACK, ESSL, FFTW, MKL, ACML, ATLAS, GSL …  Watch out for cross language (calling Fortran in C or calling C in Fortran) usage 63

Numerical Instabilities

 Specific to each problem  Could lead to much longer time  Could lead to wrong result  Examine the mathematics of the solver  Look for operations involving very large and very small numbers  Be careful when using higher compiler optimization options 64

FMA units

Y=A*X + B 00011000000010001101000011010101

*

00011000000010001101000011010101 + 00011000000010001101000011010101 Y X B In 1 cycle 65

Vector units

32 bit precision x1 00011000000010001101000011010101 x2 00011000000010001101000011010101 x3 00011000000010001101000011010101 x4 00011000000010001101000011010101 Op (+, -,*) 00011000000010001101000011010101 00011000000010001101000011010101 00011000000010001101000011010101 00011000000010001101000011010101 = 00011000000010001101000011010101 00011000000010001101000011010101 00011000000010001101000011010101 00011000000010001101000011010101 64 bit precision x1 00011000000010001101000011010101 00011000000010001101000011010101 x2 00011000000010001101000011010101 00011000000010001101000011010101 Op (+, -,*) 00011000000010001101000011010101 00011000000010001101000011010101 00011000000010001101000011010101 00011000000010001101000011010101 = 00011000000010001101000011010101 00011000000010001101000011010101 00011000000010001101000011010101 00011000000010001101000011010101 128 bit long vector unit for P4 and Opteron 4 single precision FLOPs/cycle 2 double precision FLOPs/cycle 66

Array Considerations

In Fortran do i =1,5 do j a(i,j)= … enddo enddo = 1,5 do j =1,5 do i a(i,j)= … enddo enddo = 1,5 for( } } In C/C++ j for( =1; i j =1; <=5; i a[i][j]=… j <=5; ++){ i ++){ for( i =1; i <=5; i ++){ for( j =1; j <=5; j ++){ a[i][j]=… } } Corresponding memory representation Outer 1 1 1 1 1 Inner 1 2 3 4 5 Outer 1 1 1 1 1 Inner 1 2 3 4 5 67

Tips and Tricks

Sparse Arrays

 Hard to optimize because needs to jumps when accessing memory  Minimize the memory jumps  Carefully analyze the construction of the sparse array, using pointer technique but it can be confusing  Lower your expectation 68

Minimize number of Operations

During optimization, first thing needed to do is reducing the number of unnecessary operations performed by the CPU.

do k=1,10 do j=1,5000 do i=1,5000 a(i,j,k)=3.0*m*d(k)+c(j)*23.1-b(i) enddo enddo enddo do k=1,10 dtmp(k)=3.0*m*d(k) do j=1,5000 ctmp(j)=c(j)*23.1

do i=1,5000 a(i,j,k)=dtmp(k)+ctmp(j)-b(i) enddo enddo enddo 1250 millions of operations 500 millions of operations 69

Complex Numbers

Watch for operations on complex numbers that have imaginary or real part equals to zero.

! Real part = 0 complex *16 a(1000,1000),b complex *16 c(1000,1000) do j=1,1000 do i=1,1000 c(i,j) = a(i,j)*b enddo enddo 6 millions of operations real *8 aI(1000,1000) complex *16 b,c(1000,1000) do j=1,1000 do i=1,1000 c(i,j) = (-IMAG(b)*aI(i,j), aI(I,j)*REAL(b)); enddo enddo 2 millions of operations 70

Loop Overhead and Object

do j = 1,1000000 do i = 1,1000000 do k = 1,2 a(i,j,k)=b(i,j)*c(k) enddo enddo enddo do j = 1,1000000 do i = 1,1000000 a(i,j,1)=b(i,j)*c(1) a(i,j,2)=b(I,j)*c(2) enddo enddo enddo Object declarations In Object-Oriented Language AVOID objects Declarations within the most inner loops 71

Function call Overhead

do k = 1,1000000 do j = 1,1000000 do i = 1,5000 a(i,j,k)=

fl(c(i),b(j),k

enddo enddo enddo function

fl(x,y,m)

real*8 x,y,tmp integer m tmp=x*m-y return tmp end do k = 1,1000000 do j = 1,1000000 do i = 1,5000 a(i,j,k)=

c(i)*k-b(j)

enddo enddo enddo

This can also be achieved with compilers inlining options. The compiler will then replace all function calls by a copy of the function code, sometimes leading to very large binary executable.

% ifc –ip % icc –ip % gcc –finline-functions 72

Blocking

 Blocking is used to reduce cache and TLB misses in nested matrix operations. The idea is to process as much data brought in the cache as possible do i = 1,n do j = 1,n do k = 1,n C(I,j)=C(I,j)+A(I,k)*B(k,j) enddo do ib = 1,n,bsize do jb = 1,n,bsize do kb = 1,n,bsize do i = ib,min(n,ib+bsize-1) do j = jb,min(n,jb+bsize-1) enddo enddo do k = kb,min(n,kb+bsize-1) C(I,j)=C(I,j)+ A(I,k)*B(k,j) enddo enddo enddo enddo enddo enddo 73

Loop Fusion

 The main advantage of loop fusion is the reduction of cache misses when the same array is used in both loops. It also reduces loop overhead and allow a better control of multiple instructions in a single cycle, when hardware allows it.

do i = 1,100000 a = a + x(i) + 2.0 *z(i) enddo do j = 1,100000 v = 3.0*x(j) – 3.314159267

enddo do i = 1,100000 a = a + x(i) + 2.0 *z(i) v = 3.0*x(i) – 3.314159267

enddo 74

Loop Unrolling

 The main advantage of loop unrolling is to reduce or eliminate data dependencies in loops. This is particularly useful when using a superscalar architecture.

do i = 1,1000 a = a + x(i) * y(i) enddo do i = 1,1000,4 a = a + x(i) * y(i) + x(i+1)* y(i+1) + x(i+2)* y(i+2) + x(i+3)* y(i+3) enddo 2000 cycles 1250 cycles 2 FMAs or vector units (length of 2) 75

Sum Reduction

 Sum reduction is another way of reducing or eliminating data dependencies in loops. It is more explicit than the loop unroll.

do i = 1,1000 a = a + x(i) * y(i) enddo 2000 cycles do i = 1,1000,4 a1 = a1 + x(i) * y(i) + x(i+1)* y(i+1) a2 = a2 + x(i+2)* y(i+2) + x(i+3)* y(i+3) enddo a = a1 + a2 751 cycles 2 FMAs or vector units (length of 2) 76

Better Performance in Math

 Replace division by multiplications Contrary to floating point multiplications, additions, or subtractions, divisions are very costly in terms of clock cycles. 1 multiplication = 1 cycle, 1 division = 14 ~ 20 cycles.

 Repeated multiplications for exponentials Exponential is a functional call, if the exponent is small, multiplication should be done manually.

77

Portland Group Compiler

 A comprehensive discussion of the Portland Group compiler optimization options is given in the PGI User ’ s Guide, available at http://www.pgroup.com/doc .

 Information on how to use Portland Group compiler options can be obtained on the command line with % pgf77 –fastsse –help  Detailed information on the optimization and transformation (i.e. loop unrolling) carried out by the compiler is given by the – Minfo option. This is often useful when your code produces unexpected results. 78

Portland Group Compiler

 Important compiler optimization options for the Portland Group compiler include: -fast includes “ -O2 – Munroll – Mnoframe – Mlre ” -fastsse -Mpft … includes “ -fast – Mvec=sse – Mcache_align ” -Mipa=fast enables inter-procedural analysis (IPA) and optimization -Mipa=fast,inline enables IPA-based optimization and function inlining -Mpfo enables profile and data feedback based optimizations -Minline inline functions and subroutines -Mconcur try to autoparallelize loops for SMP/dual core systems -mcmodel=medium enable data > 2GB on opterons running 64-bit linux A good start for your compilation needs is: -fastsse – Mipa=fast 79

Optimization Levels

With the Portland Group compiler the different optimization levels correspond to: -O0 the level-zero flag specifies machine code.

no optimization . The intermediate code is generated and used for the -O1 -O2 -O3 the level-one specifies local optimizations , i.e. local to a basic block the level-two specifies control-flow structure.

global optimizations . These optimizations occur over all the basic blocks, and the level-three specifies an aggressive global optimization . All level-one and level-two optimizations are also carried out.

80

-Munroll option

 The – Munroll compiler option unrolls loops. This has the effect of reducing the number of iterations in the loop by executing multople instances of the loop statements in each iteration. For example: do i = 1, 100 z = z + a(i) * b(i) enddo do i = 1, 100, 2 z = z + a(i) * b(i) z = z + a(i+1) * b(i+1) enddo  Loop unrolling reduces the overhead of maintaining the loop index, and permits better instruction scheduling (control of sending the instructions to the CPU).

81

-Mvect=sse option

 The Portland Group compiler can be used to vecoroze code. Vectorization transforms loops to improve memory access performance (i.e. maximize the usage of the various memory components, such as registers, cache).

 SSE is an acronym for Streaming SIMD Extensions, and is a set of CPU instructions, first introduced with the Intel Pentium III and AMD Athlon, which allows for the same operation acting on multiple data items concurrently.

 The use of this compiler option can double the execution speed of a code.

82

Intermediate Language

 The intermediate language used by the compiler is a language somewhere between the high-level language used by the programmer (i.e. Fortran, C), and the assembly language used by the machine.

 The intermediate language is easier for the compiler to manipulate than source code. It contains not only the algorithm specified in the source code, but expressions for calculating the memory addresses (which can also be subject to optimization).

 The intermediate language makes it much easier for the compiler to optimize source code.

83

Intermediate Language - quadruples

 Calculations in intermediate languages are simplified into quadruples. Arithmetic expressions are broken down into calculations involving only two operands and one operator. This makes sense when considering how CPU carry out a calculation. This is simplification is illustrated with the following expression A = -B + C * D / E which can be simplified into quadruples by using temporary variables: T1 = D / E T2 = C * T1 T3 = -B A = T3 + T2 84

Basic Blocks

 A more realistic example of intermediate language is given by using the example code do while (j .lt. n) k = k + j * 2 m = j * 2 j = j + 1 enddo  This code can be broken down into three basic blocks of code. A basic block is a collection of statements used to define local variables in compiler optimization. A basic block begins with a statement that either follows a branch (e.g. an IF), or is itself the target of a branch. A basic block has only one entrance (the top), and one exit (the bottom).

85

Basic Block Flow Graph

A:: t1 t2 :=j :=n t3 :=t1 .lt. t2 jump (B) t3 B:: jump(C) TRUE t4 t5 t6 t7 k :=k :=j :=t5*2 :=t4+t6 :=t7 t8 t9 m :=j :=t8*2 :=t9 t10 t11 :=j :=t10+1 j :=t11 jump(A) TRUE 86

Write Efficient C and Code Optimization

 Use unsigned integer instead of Integer  Combining division and remainder  Division and remainder by powers of two  Use switch instead of if … else …  Loop unrolling  Use lookup tables  (http://www.codeproject.com/cpp/C___Code_Optimization.

asp) 87