Transcript F2C-ACC

First experiences with Porting COSMO code to GPU
using F2C-ACC Fortran to CUDA compiler
Cristiano Padrin (CASPUR)
Piero Lanucara (CASPUR)– Alessandro Cheloni (CNMCA)
1
The GPU explosion

A huge amount of computing power: exponential growth with respect to “standard” multicore CPUs
2
Jazz Fermi GPU Cluster at CASPUR
785 MFlops/W
192 cores Intel [email protected] Ghz
14336 cores on 32 Fermi C2050
QDR IB interconnect
1 TB RAM
200 TB IB storage
14.3 Tflops Peak
CASPUR awarded as CUDA
Research Center for 2010-2011
Jazz cluster is actually number
5 of Little Green List
10.1 Tflops Linpack
3
Introduction

The problem: porting large, legacy Fortran applications on GPGPU architectures.

CUDA is the standard de-facto but only for C/C++ codes.

There is no standard yest: several GPU Fortran compilers: commercial (CAPS HMPP, PGI
Accelator and CUDA Fortran), freely available (F2C-ACC), ….

Our choice: F2C-ACC (Govett) directive-based compiler from NOAA
How F2C-ACC partecipates “in make”
filename.f90
F2C-ACC
$(F2C) $(F2COPT) filename.f90
$(M4) filename.m4 > filename.cu
$(NVCC) -c $(NVCC_OPT) -I$(INCLUDE) filename.cu
filename.m4
m4
filename.cu
nvcc
filename.o
F2C-ACC Workflow
 F2C-ACC translates Fortran code, with user added directives, in CUDA (relies on m4 library for
interlanguages dependencies)
 Some hand coding could be needed (see results)
 Debugging and optimization Tips (e.g. Thread, block synchronization, out of memory, coalesce,
occupancy....) are to be done manually
Compile and linking using CUDA libraries to create an executable to run
6
Himeno Benchmark
7
Himeno Benchmark: MPI version
1 Process - 1 GPU
2 Process
512 x 256 x 128
4 Process
512 x 256 x 64
8 Process
512 x 256 x 32
16 Process
512 x 256 x 16
8
Porting the Microphysics

In POMPA task 6 we are exploring “the possibilities of a simple porting of specific physics or
dynamics kernels to GPUs”.

During last Workshop in Manno at CSCS two different approaches emerged to deal with the
problem: one based on PGI Accelerator directives and the other one based on the F2CACC tool.

The study has been done on the Microphysics stand alone program optimized by Xavier
Lapillonne for GPU with PGI, and refered on the HPCforge.
Reference Code Structure
In microphysics program the two nested do-loop over space inside the subroutine hydci_pp has
been individuated as the part to be accelerated via PGI directives.
FILE mo_gscp_dwd.f90
FILE...
MODULE mo_gscp_dwd
MAIN...
Subr. HYDCI_PP_INIT
FILE...
Elemental Functions
Subr. HYDCI_PP
MODULE...
Subr. SATAD
FILE...
MODULE...
Accelerated
Part
via PGI dir.
Reference Code Structure
Simplified HYDCI_PP's workflow
presettings
2 nested do-loop over “i and k”
COMPUTING
UPDATE GLOBAL OUT
“SATAD” OF SOME GLOBALS
...
ACCELERATED
PART
Modified Code Structure



We proceeded to accelerate the same part of the code via F2C-ACC directives.
Due to current release limitations of F2C-ACC the code structure has been partly
modfied, while the workflow has been leaved unchanged.
The part of the code to be accelerated remain the same but this has been extracted from
hydci_pp subroutine and a file apart containing a new subroutine as been created for it:
accComp.f90.
FILE mo_gscp_dwd.f90
FILE accComp.f90
MODULE mo_gscp_dwd
Subr. accComp
Subr. HYDCI_PP_INIT
Subr. HYDCI_PP
Accelerated
Part
via F2C-ACC dir.
Modified Code Structure: why ?
Major limitations have driven the changing in the code are:

Modules are (for now) not supported → necessary variables passed to the called
subroutines and called subroutines/functions included into the file.

F2C-ACC “--kernel” option isn't carefully tested → elemental functions and subroutines
(“satad”) inlined.
FILE mo_gscp_dwd.f90
FILE accComp.f90
MODULE mo_gscp_dwd
Subr. accComp
Subr. HYDCI_PP_INIT
Subr. HYDCI_PP
Accelerated
Part
via F2C-ACC dir.
Modified Code Structure
Host / Device View
CPU
GPU
MODULE mo_gscp_dwd
Subr. HYDCI_PP_INIT
CopyIn
Subr. accComp
Accelerated
Part
via F2C-ACC dir.
Subr. HYDCI_PP
CopyOut
Results
Timesteps
CPU
GPU
F2C-ACC
250
42,685
5,952
4,814
500
84,951
11,819
9,634
750
125,973
18,389
16,650
1000
166,206
26,081
21,843
Results


The file check.dat produced by the run of the model developed with F2C-ACC
show us a better comparison with the file check.dat produced with the PGI
In particualr, we can see the comparison for one iteration between the F2CACC version and the Fortran version:
Comparing files …
#
field
nt
nd
n_err
mean R_er
max R_er
max A_er
1
t
1
3
8430
2.2E-16
6.2E-16
1.7E-13
8
tinc_lh
1
3
5681
2.6E-03
1.1E-01
1.1E-13
(
i,
j,
k)
( 16, 58, 42)
( 13, 53, 47)
Conclusions

First results are encouraging: F2C-ACC Microphysics performances are quite good.

F2C-ACC
•
•

Directive based (incremental parallelization): readable, only one source code to
mantain
“Adjustable” CUDA code is generated: portability and efficiency
F2C-ACC
•
Ongoing project: is an «application specific Fortran-to-CUDA compiler for
performance evaluation»:momentary limited support for some advanced Fortran
features (e.g. Modules)
•
Check for correctness: intrinsics (e.g. reduction), advanced Fermi features (e.g.
FMA support) are not «automatically» driven into the F2C-ACC Compiler