High Productivity Computing System Program

Download Report

Transcript High Productivity Computing System Program

Towards Optimized UPC
Implementations
Tarek A. El-Ghazawi
The George Washington University
[email protected]
Agenda
 Background
 UPC Language Overview
 Productivity
 Performance Issues
 Automatic Optimizations
 Conclusions
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
2
Parallel Programming Models
 What is a programming model?
An abstract machine which outlines the view
perceived by the programmer of data and execution
 Where architecture and applications meet
A non-binding contract between the programmer
and the compiler/system
 Good Programming Models Should
Allow efficient mapping on different architectures
Keep programming easy
 Benefits
Application - independence from architecture
Architecture - independence from applications
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
3
Programming Models
Process/Thread
Address Space
Message Passing
Shared Memory
DSM/PGAS
MPI
OpenMP
UPC
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
4
Programming Paradigms Expressivity
LOCALITY
Implicit
Explicit
P
A
R
A
L
L
E
I
S
M
Implicit
Explicit
IBM T.J. Waston
Sequential
Data Parallel
(e.g. C, Fortran, Java)
(e.g. HPF, C*)
Shared Memory Distributed Shared
(e.g. OpenMP)
Memory/PGAS
(e.g. UPC, CAF, and
Titanium)
UPC: Unified Parallel C
02/22/05
5
What is UPC?
 Unified Parallel C
 An explicit parallel extension of ISO C
 A distributed shared memory/PGAS parallel
programming language
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
6
Why not message passing?
 Performance
High-penalty for short transactions
Cost of calls
Two sided
Excessive buffering
 Ease-of-use
Explicit data transfers
Domain decomposition does not maintain the
original global application view
More code and conceptual difficulty
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
7
Why DSM/PGAS?
 Performance
No calls
Efficient short transfers
locality
 Ease-of-use
Implicit transfers
Consistent global application view
Less code and conceptual difficulty
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
8
Why DSM/PGAS:
New Opportunities for Compiler Optimizations
Thread0
Sobel Operator
Image
Ghost Zones
Thread1
Thread2
Thread3
 DSM P_Model exposes sequential remote accesses
at compile time
Opportunity for compiler directed prefetching
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
9
History
 Initial Tech. Report from IDA in collaboration with
LLNL and UCB in May 1999
 UPC consortium of government, academia, and
HPC vendors coordinated by GWU, IDA, and DoD
 The participants currently are: IDA CCS, GWU,
UCB, MTU, UMN, ARSC, UMCP, U florida, ANL,
LBNL, LLNL, DoD, DoE, HP, Cray, IBM, Sun,
Intrepid, Etnus, …
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
10
Status
 Specification v1.0 completed February of 2001, v1.1.1 in
October of 2003, v1.2 will add collectives and UPC/IO
 Benchmarking Suites: Stream, GUPS, RandomAccess, NPB
suite, Splash-2, and others
 Testing suite v1.0, v1.1
 Short courses and tutorials in the US and abroad
 Research Exhibits at SC 2000-2004
 UPC web site: upc.gwu.edu
 UPC Book by mid 2005 from John Wiley and Sons
 Manual(s)
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
11
Hardware Platforms
 UPC implementations are available for
SGI O 2000/3000
– 32 and 64b GCC
 UCB – 32 b GCC
 Intrepid
Cray T3D/E
Cray X-1
HP AlphaServer SC, Superdome
UPC Berkeley Compiler: Myrinet, Quadrics,
and Infiniband Clusters
Beowulf Reference Implementation (MPIbased, MTU)
New ongoing efforts by IBM and Sun
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
12
UPC Execution Model
 A number of threads working independently in a
SPMD fashion
MYTHREAD specifies thread index
(0..THREADS-1)
Number of threads specified at compile-time or
run-time
 Process and Data Synchronization when needed
Barriers and split phase barriers
Locks and arrays of locks
Fence
Memory consistency control
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
13
UPC Memory Model
Thread 0
Thread
THREADS-1
Thread 1
Shared
Private 0
Private 1
Private
THREADS-1
 Shared space with thread affinity, plus private spaces
 A pointer-to-shared can reference all locations in the shared space
 A private pointer may reference only addresses in its
private space or addresses in its portion of the shared space
 Static and dynamic memory allocations are supported for both
shared and private memory
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
14
UPC Pointers
 How to declare them?
int *p1;
/* private pointer pointing locally */
shared int *p2; /* private pointer pointing into
the shared space */
int *shared p3; /* shared pointer pointing locally */
shared int *shared p4; /* shared pointer pointing
into the shared space */
 You may find many using “shared pointer” to mean a
pointer pointing to a shared object, e.g. equivalent to p2 but
could be p4 as well.
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
15
UPC Pointers
Thread 0
Shared
Private
IBM T.J. Waston
P4
P3
P1
P2
P1
P2
UPC: Unified Parallel C
P1
P2
02/22/05
16
Synchronization - Barriers
 No implicit synchronization among the
threads
 UPC provides the following synchronization
mechanisms:
Barriers
Locks
Memory Consistency Control
Fence
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
17
Memory Consistency Models
 Has to do with ordering of shared operations, and
when a change of a shared object by a thread becomes
visible to others
 Consistency can be strict or relaxed
 Under the relaxed consistency model, the shared
operations can be reordered by the compiler / runtime
system
 The strict consistency model enforces sequential
ordering of shared operations. (No operation on
shared can begin before the previous ones are done,
and changes become visible immediately)
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
18
Memory Consistency Models
 User specifies the memory model through:
declarations
pragmas for a particular statement or sequence of
statements
use of barriers, and global operations
 Programmers responsible for using correct
consistency model
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
19
UPC and Productivity
 Metrics
Lines of ‘useful’ Code

indicates the development time as well as the maintenance
cost
Number of ‘useful’ Characters

alternative way to measure development and maintenance
efforts
Conceptual Complexity
 function
level,
 keyword usage,
 number of tokens,
 max loop depth,
…
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
20
Manual Effort – NPB Example
NPB-CG
NPB-EP
NPB-FT
NPB-IS
NPB-MG
#line
#char
#line
#char
#line
#char
#line
#char
#line
#char
SEQ
UPC
SEQ
MPI
665
16145
127
2868
575
13090
353
7273
610
14830
710
17200
183
4117
1018
21672
528
13114
866
21990
506
16485
130
4741
665
22188
353
7273
885
27129
1046
37501
181
6567
1278
44348
627
13324
1613
50497
UPCeffort 
IBM T.J. Waston
#UPC# SEQ
# SEQ
MPIeffort 
UPC: Unified Parallel C
UPC
Effort
(%)
6.77
6.53
44.09
43.55
77.04
65.56
49.58
80.31
41.97
48.28
MPI
Effort
(%)
106.72
127.49
36.23
38.52
92.18
99.87
77.62
83.20
82.26
86.14
# MPI # SEQ
# SEQ
02/22/05
21
Manual Effort – More Examples
#line
#char
#line
Histogram
#char
#line
N-Queens
#char
GUPS
SEQ
MPI
SEQ
UPC
41
1063
12
188
86
1555
98
2979
30
705
166
3332
41
1063
12
188
86
1555
47
1251
20
376
139
2516
UPCeffort 
IBM T.J. Waston
#UPC# SEQ
# SEQ
MPIeffort 
UPC: Unified Parallel C
MPI
Effort
(%)
139.02
180.02
150.00
275.00
93.02
124.28
UPC
Effort
(%)
14.63
17.68
66.67
100.00
61.63
61.80
# MPI # SEQ
# SEQ
02/22/05
22
Conceptual Complexity - HIST
Work Data
Distr. Distr.
HISTOGRAM MPI
HISTOGRAM UPC
#Parameters
IBM T.J. Waston
#Function
calls
#References
to THREADS
and
MYTHREAD
#UPC
Constructs
& UPC Types
Notes
#Parameters
#Function
calls
# References
to myrank
and nprocs
#MPI Types
Notes
Comm.
Synch. &
Consist.
Misc. Ops
Sum Overall
Score
5
0
4
0
0
0
3
4
0
0
12
4
2
1
0
0
0
3
22
0
2 if
1 for
2
0
2
shared
decl.
1
0
3
1 lockdec
1
lock/unlock
2 barriers
5
0
0
0
15
2
0
2
6
4
26
8
3
0
2
0
2
5
0
0
8
2 if
1 for
6
0
2
1
Scatter
1
Reduce
(implicit w.
Collective)
1
Init/Finalize
2 Comm
UPC: Unified Parallel C
47
02/22/05
23
Conceptual Complexity - GUPS
GUPS UPC
#Parameters
#Function
calls
#References
to THREADS
and
MYTHREAD
#UPC
Constructs
& UPC Types
Notes
GUPS MPI
#Parameters
IBM T.J. Waston
#Function
calls
# References
to myrank
and nprocs
#MPI Types
Notes
Work
Distr.
Data
Distr.
Comm.
Synch. &
Consist.
Misc. Ops
Sum Overall
Score
21
0
6
4
0
0
0
2
0
0
27
6
3
4
0
0
0
7
43
3
0
3
forall
2 for
3 if
5 shared
2
all_alloc
2 free
18
0
17
7
38
6
3
5
0
5 for
3 if
0
0
0
3
1
3
6
6
80
22
13
1
4
26
6
2
0
0
8
2 mem
alloc
2 mem
free
3
window
2 onesided
4 collect
(implicit
w.
Collective
and
WinFence)
1 barrier
Init
Finalize
comm_rank
comm_size
2 Wtime
(6 error
handle)
2 barriers
UPC: Unified Parallel C
136
02/22/05
24
UPC Optimizations Issues
 Particular Challenges
Avoiding Address Translation
Cost of Address Translation
 Special Opportunities
Locality-driven compiler-directed prefetching
Aggregation
 General
Low-level optimized libraries, e.g. collective
Backend optimizations
Overlapping of remote accesses and
synchronization with other work
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
25
Showing Potential Optimizations
Through Emulated Hand-Tunings
 Different Hand-tuning levels:
 Unoptimized UPC code

referred as UPC.O0
 Privatized UPC code

referred as UPC.O1
 Prefetched UPC code


hand-optimized variant using block get/put to mimic the effect of
prefetching
referred as UPC.O2
 Fully Hand-Tuned UPC code


Hand-optimized variant integrating privatization, aggregation of
remote accesses as well as prefetching
Referred as UPC.O3
 T. El-Ghazawi and S. Chauvin, “UPC Benchmarking Issues”, 30th Annual
Conference IEEE International Conference on Parallel Processing,2001
(ICPP01) Pages: 365-372
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
26
STREAM BENCHMARK
Address Translation Cost and
Local Space Privatization- Cluster
MB/s
Put
Get
Scale
Sum
CC
N/A
N/A
1565.04
5409.3
UPC Private
N/A
N/A
1687.63
1776.81
UPC Local
1196.51
1082.89
54.22
82.7
UPC Remote
241.43
237.51
0.09
0.16
MB/s
Copy (arr)
Copy (ptr)
Memcpy
Memset
CC
1340.99
1488.02
1223.86
2401.26
UPC Private
1383.57
433.45
1252.47
2352.71
UPC Local
47.2
90.67
1202.8
2398.9
UPC Remote
0.09
0.20
1197.22
2360.59
Results gathered on a Myrinet Cluster
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
27
Address Translation and Local Space
Privatization –
DSM ARCHITECTURE
STREAM BENCHMARK MB/S
Bulk operations
Element-by-Element operations
MB/Sec
Memory
copy
Block
Get
Block
Put
Array
Set
Array
Copy
Sum
Scale
GCC
127
N/A
N/A
175
106
223
108
UPC Private
127
N/A
N/A
173
106
215
107
UPC Local Shared
139
140
136
26
14
31
13
UPC Remote Shared
(within SMP node)
130
129
136
26
13
30
13
UPC Remote Shared
(beyond SMP node)
112
117
136
24
12
28
12
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
28
Aggregation and Overlapping of Remote
Shared Memory Accesses
100
0.25
Execution Time (sec)
Execution Time (sec)
0.2
0.15
0.1
10
1
0.1
0.05
0
0.01
1
2
4
8
16
1
2
4
THREADS
UPC NO OPT.
8
16
32
NP
UPC FULL OPT.
UPC NO OPT.
UPC N-Queens:
Execution Time
UPC FULL OPT.
UPC Sobel Edge:
Execution Time
 Benefit of hand-optimizations are greatly application dependent:
 N-Queens does not perform any better, mainly because it is an
embarrassingly parallel program
 Sobel Edge Detector does get a speedup of one order of magnitude
after hand-optimizating, scales linearly perfectly.

SGI O2000
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
29
Impact of Hand-Optimizations on NPB.CG
70
60
Computation Time (sec)
50
40
30
20
10
0
1
2
4
8
16
32
Processors
UPC - O0
IBM T.J. Waston
UPC - O1
UPC - O3
GCC
UPC: Unified Parallel C
Class A on
SGI Origin 2k
02/22/05
30
Shared Address Translation
Overhead
X
Y
Z
Z
Actual
Access
Actual
Access
PRIVATE
MEMORY
ACCESS
Address
Calculation
Overhead
600
Address
Translation
Overhead
Local Shared Access Time (ns)
UPC Put/Get
Function Call
Overhead
LOCAL
SHARED
MEMORY
ACCESS
500
123
400
300
247
200
100
144
0
Local Shared memory access
Memory Access Time
Overhead Present in Local-Shared Memory
Accesses (SGI Origin 2000, GCC-UPC)
Address Calculation
Address Function Call
Quantification of the Address
Translation Overheads
 Address translation overhead is quite significant
More than 70% of work for a local-shared memory access
 Demonstrates the real need for optimization
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
31
Shared Address Translation Overheads for
Sobel Edge Detection
100
Execution Time (sec)
90
80
70
60
50
40
30
20
10
1
2
Processing + Memory Access
Address Calculation
4
8
Address Function Call
UPC.O3
UPC.O0
UPC.O3
UPC.O0
UPC.O3
UPC.O0
UPC.O3
UPC.O0
UPC.O3
UPC.O0
0
16
#Processors
UPC.O0: unoptimized UPC code, UPC.O3: handoptimized UPC code. Ox notations from
T. El-Ghazawi, S. Chauvin, “UPC Benchmarking Issues”, Proceedings of the 2001 International Conference
on Parallel Processing, Valencia, September 2001
02/22/05
UPC: Unified Parallel C
IBM T.J. Waston
32
Reducing Address Translation Overheads via
Translation Look-Aside Buffers
 F. Cantonnet, T. El-Ghazawi, P. Lorenz, J. Gaber, “Fast
Address Translation Techniques for Distributed Shared
Memory Compilers”, IPDPS’05, Denver CO, April 2005
 Use Look-up Memory Model Translation Buffers
(MMTB) to perform fast translations
 Two alternative methods proposed to create and use
MMTB’s:
FT: basic method using direct addressing
RT: advanced method, using indexed addressing
 Was prototyped as a compiler-enabled optimization
no modifications to actual UPC codes are needed
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
33
Different Strategies – Full-Table
Array distributed
across 4 THREADS
array[0]
TH0
array[1]
TH1
array[2]
TH2
array[3]
TH3
array[4]
TH0
array[5]
TH1
array[6]
TH2
array[7]
TH3
MMTB stored
on each thread
[0]
[0]
[0]
[1]
[1]
[1]
[2]
[2]
[2]
[3]
[3]
[3]
[4]
[4]
[4]
[5]
[5]
[5]
[6]
[6]
[6]
[7]
[7]
[7]
57FF8040
57FF8040
57FF8040
FT[0] 5FFF8040
5FFF8040
5FFF8040
FT[1] 67FF8040
67FF8040
67FF8040
FT[2] 6FFF8040
6FFF8040
6FFF8040
FT[3] 57FF8048
57FF8048
57FF8048
FT[4] 5FFF8048
5FFF8048
5FFF8048
FT[5] 67FF8048
67FF8048
67FF8048
FT[6] 6FFF8048
6FFF8048
6FFF8048
FT[7]
FT Look-up Table
shared int array[8];
Consider shared [B] int array[8];
To Initialize FT:
Data affinity
i  [0,7], FT[i] = _get_vaddr(&array[i])
TH0
TH2
To Access
array[ ]: TH1
[0]
[1]
[2]
i  [0,7], array[i]
= _get_value_at(FT[i])
[4]
IBM T.J. Waston
[5]
[6]
Pros

Direct mapping

No address
calculation
Cons

Large memory required

Can lead to
competition over
caches and main
memory
TH3
[3]
[7]
UPC: Unified Parallel C
02/22/05
34
Different Strategies –
Reduced-Table: Infinite blocksize
RT Strategy:
BLOCKSIZE=infinite
Only first address of the element of the array needs to be saved since all array data is
contiguous

Only one table
entry in this case

Address
calculation step is
simple in that case
Consider shared [] int array[4];
To initialize RT:
RT[0] = _get_vaddr(&array[0])
To access array[]:
i [0,3], array[i] = _get_value_at( RT[0] + i )
array[0]
i
array[1]
array[2]
array[3]
RT[0]
THREAD0
IBM T.J. Waston
RT[0]
THREAD1
RT[0]
THREAD2
RT[0]
THREAD3
UPC: Unified Parallel C
02/22/05
35
Different Strategies –
Reduced-Table: Default blocksize
BLOCKSIZE=1
RT Strategy:
Only first address of elements on each thread are saved since all array data is contiguous
Consider shared [1] int array[16];

Less memory required
than FT, MMTB buffer
has threads entries

Address calculation
step is a bit costly but
much cheaper than
current
implementations
To initialize RT:
i [0,THREADS-1], RT[i] = _get_vaddr(&array[i])
To access array[]:
i [0,15], array[i] = _get_value_at(
RT[i mod THREADS] + (i/THREADS))
array[0]
array[1]
array[2]
array[3]
RT[0]
array[4]
array[5]
array[6]
array[7]
RT[1]
array[8]
array[9]
array[10]
array[11]
array[12]
array[13]
array[14]
array[15]
RT
RT
RT
RT
RT[2]
RT[3]
RT
THREAD0
IBM T.J. Waston
THREAD1
THREAD2
UPC: Unified Parallel C
THREAD3
02/22/05
36
Different Strategies –
Reduced-Table: Arbitrary blocksize
RT Strategy:
ARBITRARY BLOCK SIZES
Only first address of elements of each block are saved since all block data is contiguous
Consider shared [2] int array[16];

Less memory
required than for
FT, but more than
previous cases

Address
calculation step
more costly than
previous cases
To initialize T:
i [0,7], RT[i] = _get_vaddr(&array[i*blocksize(array)])
To access array[]:
i [0,15], array[i] = _get_value_at(
RT[i / blocksize(array)] + (i mod blocksize(array)) )
RT[0]
RT[1]
array[0]
array[2]
array[4]
array[6]
RT[2]
array[1]
array[3]
array[5]
array[7]
RT[3]
array[8]
array[10]
array[12]
array[14]
array[9]
array[11]
array[13]
array[15]
RT
RT
RT
RT
RT[4]
RT[5]
RT[6]
RT[7]
THREAD0
THREAD1
THREAD2
THREAD3
RT
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
37
Performance Impact of the MMTB
– Sobel Edge
Sobel Edge (N=2048)
Sobel Edge (N=2048)
3
16
14
2.5
Execution Time (sec)
Execution Time (sec)
12
10
8
6
2
1.5
1
4
0.5
2
0
0
1
2
4
8
16
1
2
4
O0
O0.FT
O0.RT
8
16
#THREADS
#THREADS
O3
O0.FT
MPI
O0.RT
O3
MPI
Performance of Sobel-Edge Detection using new MMTB strategies (with and without O0)

FT and RT are performing around 6 to 8 folds better than the regular basic UPC
version (O0)

RT strategy slower than FT since address calculation (arbitrary block size case),
becomes more complex.

FT on the other hand is performing almost as good as the hand-tuned versions
(O3 and MPI)
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
38
Performance Impact of the MMTB
– Matrix Multiplication
MATRIX MULTIPLICATION (N=256)
MATRIX MULTIPLICATION (N=256)
16
14
2 THREADS
4 THREADS
8 THREADS
16 THREADS
14
12
12
Time (sec)
10
8
6
10
8
6
4
4
2
2
0
Performance and Hardware Profiling of Matrix
Multiplication using new MMTB strategies
T
.O
0.
R
3
0.
FT
.O
PC
U
0
.O
PC
U
U
PC
T
.O
0.
R
PC
.O
PC
U
U
3
.O
0.
FT
.O
PC
U
PC
T
.O
PC
0.
R
0
U
PC
.O
.O
U
U
3
0.
FT
0
.O
PC
U
U
PC
T
.O
0.
R
PC
.O
U
3
0.
FT
.O
PC
U
0
.O
PC
U
PC
T
.O
0.
R
.O
.O
PC
U
MPI
3
0
UPC.O3
U
UPC.O0.RT
U
UPC.O0.FT
.O
.O
PC
UPC.O0
0.
FT
0
PC
16
PC
8
U
4
# THREADS
PC
2
U
1
U
Execution Time (sec)
1 THREAD
16
THREADS
Computation
L1 Data Cache Misses
L2 Data Cache Misses
Graduated Loads
Graduated Stores
Decoded Branches
TLB Misses

FT strategy: increase in L1 data cache misses due to the large table size

RT strategy: L1 kept low, but increase in number of loads and stores is observed
showing increase in computations (arbitrary blocksize used)
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
39
Time and storage requirements of
the Address Translation Methods for the
Matrix Multiply Microkernel
For a shared array of
N elements with B
as blocksize
Storage
requirements per
shared array
UPC.O0
N E
UPC.O0.FT
N  E  N  P  THREADS
UPC.O0.RT
# of memory
accesses per
shared memory
access
# of arithmetic
operations per
shared memory
access
More than 25
More than 5
1
0
1
Up to 3
N
N  E   P  THREADS
B
(E: element size in bytes,
P: pointer size in bytes)
Comparison among Optimizations of Storage, Memory Accesses
and Computation Requirements

Number of loads and stores can increase with arithmetic operators
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
40
UPC Work-sharing Construct Optimizations
By thread/index number
By thread/index number
(for integer)
(upc_forall integer)
for(i=0; i<N; i++)
upc_forall(i=0; i<N; i++; i)
{
loop body;
if(MYTHREAD == i%THREADS)
By the address of a shared
variable (upc_forall address)
upc_forall(i=0; i<N; i++; &shared_var[i])
loop body;
loop body;
}
By the address of a shared
variable (for address)
for(i=0; i<N; i++)
By thread/index number
{
(for optimized)
if(upc_threadof(&shared_var[i]) ==
MYTHREAD)
for(i=MYTHREAD; i<N; i+=THREADS)
loop body;
loop body;
}
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
41
Performance of Equivalent upc_forall and for Loops
0.06
0.05
Time (sec.)
0.04
0.03
0.02
0.01
0
1
2
upc_forall address
IBM T.J. Waston
4
Processor(s)
upc_forall integer
for address
8
for integer
UPC: Unified Parallel C
16
for optimized
02/22/05
42
Performance Limitations Imposed by
Sequential C Compilers -- STREAM
BULK
Element-by-Element
memcpy
memset
Struct cp
Copy (arr)
Copy (ptr)
Set
Sum
Scale
Add
Triad
NUMA
(MB/s)
F
291.21
163.90
N/A
291.59
N/A
159.68
135.37
246.3
235.1
303.82
C
231.20
214.62
158.86
120.57
152.77
147.70
298.38
133.4
13.86
20.71
BULK
Element-by-Element
memcpy
memset
Struct cp
Copy (arr)
Copy (ptr)
Set
Sum
Scale
Add
Triad
Vector
(MB/s)
F
14423
11051
N/A
14407
N/A
11015
17837
14423
10715
16053
C
18850
5307
7882
7972
7969
10576
18260
7865
3874
5824
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
43
Loopmark – SET/ADD Operations
BULK
Element-by-Element
memcpy
memset
Struct cp
Copy (arr)
Copy (ptr)
Set
Sum
Scale
Add
Triad
Vector
F
14423
11051
N/A
14407
N/A
11015
17837
14423
10715
16053
C
18850
5307
7882
7972
7969
10576
18260
7865
3874
5824
Let us compare loopmarks for each
F / C operation
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
44
Loopmark – SET/ADD Operations
Fortran
C
MEMSET (bulk set)
MEMSET (bulk set)
146.
1
t = mysecond(tflag)
163.
147.
1 V M--<><>
a(1:n) = 1.0d0
164.
1
148.
1
t = mysecond(tflag) - t
1
memset(a, 1,
NDIM*sizeof(elem_t));;
times[1][k] = mysecond_();
149.
1
times(2,k) = t
165.
1
times[1][k];
times[1][k] = mysecond_()
SET
SET
158.
1
arrsum = 2.0d0;
217.
1
set = 2;
159.
1
t = mysecond(tflag)
220.
1
times[5][k] = mysecond_();
160.
1 MV------<
DO i = 1,n
222.
1 MV--<
for (i=0; i<NDIM; i++)
161.
1 MV
c(i) = arrsum
223.
1 MV
162.
1 MV
arrsum = arrsum + 1
224.
1 MV
163.
1 MV------>
END DO
225.
1 MV-->
164.
1
t = mysecond(tflag) - t
227.
165.
1
times(4,k) = t
180.
1
t = mysecond(tflag)
181.
1 V M--<><>
c(1:n) = a(1:n) +
182.
1
t = mysecond(tflag) - t
times(7,k) = t
1
times[5][k];
{
c[i] = (set++);
}
times[5][k] = mysecond_()
1
-
ADD
ADD
183.
-
b(1:n)
283.
1
times[10][k]= mysecond_();
285.
1 Vp--<
for (j=0; j<NDIM; j++)
286.
1 Vp
287.
1 Vp
288.
1 Vp-->
290.
{
c[j] = a[j] + b[j];
}
1
times[10][k] = mysecond_()
times[10][k];
-
Legend: V: Vectorized – M: Multistreamed – p: conditional, partial and/or computed
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
45
UPC vs CAF using the NPB workloads
 In General, UPC slower than CAF, mainly due to
Point-to-point vs barrier synchronization


Better scalability with proper collective operations
Program writers can do a p-to-p syncronization using current
constructs
Scalar performance of source-to-source translated code

Alias analysis (C pointers)
» Can highlight the need for explicitly using restrict to help
several compiler backends

Lack of support for multi-dimensional arrays in C
» Can prevent high level loop transformations and software
pipelining, causing a 2 times slowdown in SP for UPC
Need for exhaustive C compiler analysis
» A failure to perform proper loop fusion and alignment in the
critical section of MG can lead to 51% more loads for UPC
than CAF
» A failure to unroll adequately the sparse matrix-vector
Parallel
multiplication inUPC:
CGUnified
can lead
to Cmore cycles in UPC 02/22/05

IBM T.J. Waston
46
Conclusions
 UPC is a locality-aware parallel programming
language
 With proper optimizations, UPC can outperform
MPI in random short accesses and can otherwise
perform as good as MPI
 UPC is very productive and UPC applications result
in much smaller and more readable code than MPI
 UPC compiler optimizations are still lagging, in
spite of the fact that substantial progress has been
made
 For future architectures, UPC has the unique
opportunity of having very efficient
implementations as most of the pitfalls and
obstacles are revealed along with adequate
UPC: Unified Parallel C
IBM T.J.solutions
Waston
02/22/05
47
Conclusions
 In general, four types of optimizations:
Optimizations to Exploit the Locality
Consciousness and other Unique Features of
UPC
Optimizations to Keep the Overhead of UPC low
Optimizations to Exploit Architectural Features
Standard Optimizations that are Applicable to all
Systems Compilers
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
48
Conclusions
 Optimizations possible at three levels:
Source to source program acting during the
compilation phase and incorporating most UPC
specific optimizations
C backend compilers to compete with Fortran
Strong run-time system that can work
effectively with the Operating System
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
49
Selected Publications
 T. El-Ghazawi, W. Carlson, T. Sterling, and K.
Yelick, UPC: Distributed Shared Memory
Programming. John Wiley &Sons Inc., New York,
2005. ISBN: 0-471-22048-5. (June 2005)
 T. El-Ghazawi, F. Cantonnet, Y. Yao, S. Annareddy,
A. Mohamed, Benchmarking Parallel Compilers for
Distributed Shared Memory Languages: A UPC
Case Study, Journal of Future Generation
Computer Systems, North-Holland (Accepted)
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
50
Selected Publications
 T. El-Ghazawi and S. Chauvin, “UPC Benchmarking
Issues”, 30th Annual Conference IEEE International
Conference on Parallel Processing,2001 (ICPP01)
Pages: 365-372
 T. El-Ghazawi and F. Cantonnet. “UPC performance
and potential: A NPB experimental study”.
Supercomputing 2002 (SC2002), Baltimore, November
2002
 F. Cantonnet, T. El-Ghazawi, P. Lorenz, J. Gaber, “Fast
Address Translation Techniques for Distributed Shared
Memory Compilers”, IPDPS’05, Denver CO, April 2005
 CUG and PPOP
IBM T.J. Waston
UPC: Unified Parallel C
02/22/05
51