CS 267: Applications of Parallel Computers Lecture 10: Unified Parallel C (UPC) Kathy Yelick http://www-inst.eecs.berkeley.edu/~cs267 Based on slides by Tarek El-Ghazawi (GMU), Bill Carlson (IDA-CCS), Greg.

Download Report

Transcript CS 267: Applications of Parallel Computers Lecture 10: Unified Parallel C (UPC) Kathy Yelick http://www-inst.eecs.berkeley.edu/~cs267 Based on slides by Tarek El-Ghazawi (GMU), Bill Carlson (IDA-CCS), Greg.

CS 267: Applications of Parallel Computers
Lecture 10:
Unified Parallel C (UPC)
Kathy Yelick
http://www-inst.eecs.berkeley.edu/~cs267
Based on slides by Tarek El-Ghazawi (GMU),
Bill Carlson (IDA-CCS), Greg Fischer (Cray Inc.)
11/6/2015
CS267, Yelick
1
Administrivia
• Lecture schedule:
• 10/1: UPC
• 10/3: Titanium (Dan Bonachea)
• 10/8: Computational Biology (Teresa Head-Gordon)
• Includes project idea
•
•
•
•
•
10/10: The TAU Performance Tool (Sameer Shend)
10/15: Dense Matrix Products
10/17: Sparse Matrix Products
10/22: Dense Matrix Solvers (Jim Demmel)
10/24: Sparse Direct Solvers (Xiaoye Li)
11/6/2015
CS267, Yelick
2
Outline
• UPC Motivation
• Programming in UPC
• Parallelism model
• Communication
• Synchronization
• Early Experience
• UPC History and Status
11/6/2015
CS267, Yelick
3
UPC Motivation
11/6/2015
CS267, Yelick
4
The Message Passing Model
• Positive:
• Programmers control data and work
distribution.
• Negative:
• Significant communication overhead for
small transactions
Network
• Not easy to use
• Example: MPI
Address space
Process
11/6/2015
CS267, Yelick
5
The Shared Memory Model
• Positive:
• Simple statements
Thread
Thread
Thread
…
Shared Variable x
• read remote memory via an
expression
• write remote memory through
assignment
Thread
• Negative:
Shared address space
• Manipulating shared data may
require synchronization
• Does not allow locality
exploitation
•
11/6/2015
Examples: Threads, OpenMP
CS267, Yelick
6
The Global Address Space Model
• Similar to the shared
memory in semantics
Th0
One
partitioned
Th1
Th2
Th3
Th4
x
M0
shared
address
space
M1
M2
M3
M4
• Memory Mi has affinity to
thread Thi
• Positive:
• Helps exploiting locality of
references
• Simple statements as SM
• Negative:
• Explicit control over layout
• Consistency complicated
• Examples: UPC, Titanium,
Co-Array Fortran
11/6/2015
CS267, Yelick
7
Programming in UPC
11/6/2015
CS267, Yelick
8
Parallel Programming Overview
Basic parallel programming problems:
1. Creating parallelism
• SPMD Model
2. Communication between processors
• Private vs. Shared variables
• Shared arrays
• Consistency models
3. Synchronization
• Point-to-point synchronization
• Global synchronization
11/6/2015
CS267, Yelick
9
UPC Memory View
Global address space
Thread 0
Thread
THREADS-1
Thread 1
Shared
Private 0
Private 1
Private
THREADS-1
• A shared pointer can reference all locations in the
shared space
• A private pointer may reference only addresses in its
private space or addresses in its portion of the shared
space
• Static and dynamic memory allocations are supported
for both shared and private memory
11/6/2015
CS267, Yelick
10
Parallelism Model
• A set if THREADS threads working independently
• Two compilation models
• THREADS may be fixed at compile time or
• Dynamically set at program startup time
• MYTHREAD specifies thread index (0..THREADS-1)
• Simple synchronization mechanisms (barriers, locks, ..)
• Written:
upc.barrier;
• In previous spec (and some implementations) was:
upc.barrier();
11/6/2015
CS267, Yelick
11
Shared and Private Variables
• A shared variable has one instance, shared by all threads.
• Affinity to thread 0 by default (allocated in processor 0’s memory)
• A private variable has an instance per thread
• Example:
private int x;
shared int y;
x = 0; y = 0;
x += 1; y += 1;
• After executing this code
• x will be 1 in all threads; y will be between 1 and THREADS
• Shared scalar variable are somewhat rare because:
• cannot be automatic (declared in a function) (Why not?)
11/6/2015
CS267, Yelick
12
UPC Pointers
Global address space
• Pointers may point to shared or private variables
• Same syntax for use, just add qualifier
shared int *sp;
int *lp;
• sp is a pointer to an integer residing in the shared
memory space.
• sp is called a shared pointer (somewhat sloppy).
11/6/2015
x: 3
sp:
lp:
Shared
sp:
sp:
lp:
lp:
CS267, Yelick
Private
13
UPC Pointers
• May also have a pointer variable that is shared.
shared int * shared sps;
Global address space
int * shared spl; // does this make sense?
sps:
Shared
spl:
Private
• The most common case is a private variable that points
to a shared object (called a shared pointer)
11/6/2015
CS267, Yelick
14
Shared and Private Rules
• Default: Types that are neither shared-qualified nor
private-qualified are considered private.
• This makes porting uniprocessor libraries easy
• Makes porting shared memory code somewhat harder
• Casting pointers:
• A pointer to a private variable may not be cast to a shared type.
• If a pointer to a shared variable is cast to a pointer to a private
object:
•
If the object has affinity with the casting thread, this is fine.
•
If not, attempts to de-reference that private pointer are undefined.
(Some compilers may give better errors than others.)
• Why?
11/6/2015
CS267, Yelick
15
Shared Arrays
• Shared array elements are spread across the threads
shared int x[THREADS]
/*One element per thread */
shared int y[3][THREADS] /* 3 elements per thread */
shared int z[3*THREADS] /* 3 elements per thread, cyclic */
• In the pictures below
• Assume THREADS = 4
• Elements with affinity to processor 0 are red
x
y
blocked
z
cyclic
11/6/2015
CS267, Yelick
Of course,
this is really a
2D array
16
Example: Vector Addition
• Questions about parallel vector additions:
• How to layout data (here it is cyclic)
• Which processor does what (here it is “owner computes”)
/* vadd.c */
#include <upc_relaxed.h>
#define N 100*THREADS
cyclic layout
shared int v1[N], v2[N], sum[N];
void main() {
owner computes
int i;
for(i=0; i<N; i++)
if (MYTHREAD = = i%THREADS)
sum[i]=v1[i]+v2[i];
}
11/6/2015
CS267, Yelick
17
Shared Pointers
• In the C tradition, array can be access through pointers
• Here is the vector addition example using pointers
#include <upc_relaxed.h>
#define N 100*THREADS
shared int v1[N], v2[N], sum[N];
void main() {
int i;
shared int *p1, *p2;
v1
p1
p1=v1; p2=v2;
for (i=0; i<N; i++, p1++, p2++) if (i %THREADS= = MYTHREAD)
sum[i]=*p1+*p2;
}
11/6/2015
CS267, Yelick
18
Work Sharing with upc_forall()
• Iteration are independent
• Each thread gets a bunch of iterations
• Simple C-like syntax and semantics
upc_forall(init; test; loop; affinity)
statement;
• Affinity field to distribute the work
• Round robin
• Chunks of iterations
• Semantics are undefined if there are
dependencies between iterations
• Programmer has indicated iterations are independent
11/6/2015
CS267, Yelick
19
Vector Addition with upc_forall
• The loop in vadd is common, so there is upc_forall:
• 4th argument is int expression that gives “affinity”
• Iteration executes when:
• affinity%THREADS is MYTHREAD
/* vadd.c */
#include <upc_relaxed.h>
#define N 100*THREADS
shared int v1[N], v2[N], sum[N];
void main() {
int i;
upc_forall(i=0; i<N; i++; i)
sum[i]=v1[i]+v2[i];
}
11/6/2015
CS267, Yelick
20
Synchronization
• No implicit synchronization among the threads
• UPC provides many synchronization
mechanisms:
• Barriers (Blocking)
• upc_barrier
• Split Phase Barriers (Non Blocking)
• upc_notify
• upc_wait
• Locks
11/6/2015
CS267, Yelick
21
UPC Vector Matrix Multiplication Code
• Here is one possible matrix-vector multiplication
// vect_mat_mult.c
#include <upc_relaxed.h>
shared int a[THREADS][THREADS];
shared int b[THREADS], c[THREADS];
void main (void) {
int i, j , l;
upc_forall( i = 0 ; i < THREADS ; i++; i) {
c[i] = 0;
for ( l= 0 ; l THREADS ; l++)
c[i] += a[i][l]*b[l];
}
}
11/6/2015
CS267, Yelick
22
Th. 2
Th. 1
B
Th. 0
Data Distribution
Thread 2
Thread 1
Thread 0
Th. 0
*
=
Th. 1
Th. 2
A
11/6/2015
B
CS267, Yelick
C
23
Th. 2
Th. 1
B
Th. 0
A Better Data Distribution
Th. 0
Thread 0
Thread 1
Thread 2
A
11/6/2015
*
=
Th. 1
Th. 2
B
CS267, Yelick
C
24
Layouts in General
• All non-array objects have affinity with thread zero.
• Array layouts are controlled by layout specifiers.
layout_specifier::
null
layout_specifier [ integer_expression ]
• The affinity of an array element is defined in terms of the
• block size, a compile-time constant, and THREADS a
runtime constant.
• Element i has affinity with thread
( i / block_size) % PROCS.
11/6/2015
CS267, Yelick
25
Layout Terminology
• Notation is HPF, but terminology is language-independent
• Assume there are 4 processors
(Block, *)
(Cyclic, *)
11/6/2015
(*, Block)
(Cyclic, Cyclic)
CS267, Yelick
(Block, Block)
(Cyclic, Block)
26
2D Array Layouts in UPC
• Array a1 has a row layout and array a2 has a block row layout.
shared [m] int a1 [n][m];
shared [k*m] int a2 [n][m];
• If (k + m) % THREADS = = 0 them a3 has a row layout
shared int a3 [n][m+k];
• To get more general HPF and ScaLAPACK style 2D blocked
layouts, one needs to add dimensions.
• Assume r*c = THREADS;
shared [b1][b2] int a5 [m][n][r][c][b1][b2];
• or equivalently
shared [b1*b2] int a5 [m][n][r][c][b1][b2];
11/6/2015
CS267, Yelick
27
UPC Vector Matrix Multiplication Code
• Matrix-vector multiplication with better layout
// vect_mat_mult.c
#include <upc_relaxed.h>
shared [THREADS] int a[THREADS][THREADS];
shared int b[THREADS], c[THREADS];
void main (void) {
int i, j , l;
upc_forall( i = 0 ; i < THREADS ; i++; i) {
c[i] = 0;
for ( l= 0 ; l THREADS ; l++)
c[i] += a[i][l]*b[l];
}
}
11/6/2015
CS267, Yelick
28
Example: Matrix Multiplication in UPC
• Given two integer matrices A(NxP) and B(PxM)
• Compute C =A x B.
• Entries Cij in C are computed by the formula:
p
C
11/6/2015
ij
  Ail  Blj
l 1
CS267, Yelick
29
Matrix Multiply in C
#include <stdlib.h>
#include <time.h>
#define N 4
#define P 4
#define M 4
int a[N][P], c[N][M];
int b[P][M];
void main (void) {
int i, j , l;
for (i = 0 ; i<N ; i++) {
for (j=0 ; j<M ;j++) {
c[i][j] = 0;
for (l = 0 ; lP ; l++) c[i][j] += a[i][l]*b[l][j];
}
}
}
11/6/2015
CS267, Yelick
30
Domain Decomposition for UPC
• Exploits locality in matrix multiplication
•
A (N  P) is decomposed row-wise
•
into blocks of size (N  P) / THREADS
as shown below:
B(P  M) is decomposed column wise
into M/ THREADS blocks as shown
below:
P
0 .. (N*P / THREADS) -1
(N*P / THREADS)..(2*N*P / THREADS)-1
Thread THREADS-1
Thread 0
M
Thread 0
Thread 1
N
P
((THREADS-1)N*P) / THREADS ..
(THREADS*N*P / THREADS)-1
Thread THREADS-1
•Note: N and M are assumed to be multiples
of THREADS
11/6/2015
Columns 0:
(M/THREADS)-1
CS267, Yelick
Columns ((THREAD-1) 
M)/THREADS:(M-1)
31
UPC Matrix Multiplication Code
/* mat_mult_1.c */
#include <upc_relaxed.h>
#define N 4
#define P 4
#define M 4
shared [N*P /THREADS] int a[N][P], c[N][M];
// a and c are row-wise blocked shared matrices
shared[M/THREADS] int b[P][M]; //column-wise blocking
void main (void) {
int i, j , l; // private variables
upc_forall(i = 0 ; i<N ; i++; &c[i][0]) {
for (j=0 ; j<M ;j++) {
c[i][j] = 0;
for (l= 0 ; lP ; l++) c[i][j] += a[i][l]*b[l][j];
}
}
}
11/6/2015
CS267, Yelick
32
Notes on the Matrix Multiplication Example
• The UPC code for the matrix multiplication is almost
the same size as the sequential code
• Shared variable declarations include the keyword
shared
• Making a private copy of matrix B in each thread
might result in better performance since many remote
memory operations can be avoided
• Can be done with the help of upc_memget
11/6/2015
CS267, Yelick
33
Memory Consistency in UPC
• The consistency model of shared memory accesses are
controlled by designating accesses as strict, relaxed, or
unualified (the default).
• There are several ways of designating the ordering type.
• A type qualifier, strict or relaxed can be used to affect all
variables of that type.
• Labels strict or relaxed can be used to control the
accesses within a statement.
•
strict : { x = y ; z = y+1; }
• A strict or relaxed cast can be used to override the
11/6/2015
CS267, Yelick
current label or type qualifier.
34
Matrix Multiplication with Block Copy
/* mat_mult_3.c */
#include <upc_relaxed.h>
shared [N*P /THREADS] int a[N][P], c[N][M]; /* blocked shared matrices */
shared [M/THREADS] int b[P][M];
int b_local[P][M];
void main (void) {
int i, j , l; // private variables
upc_memget(b_local, b, P*M*sizeof(int));
upc_forall(i = 0 ; i<N ; i++; &c[i][0]) {
for (j=0 ; j<M ;j++) {
c[i][j] = 0;
for (l= 0 ; lP ; l++) c[i][j] += a[i][l]*b_local[l][j];
}
}
}
11/6/2015
CS267, Yelick
35
Performance Tuning in a
Global Address Space
Language
Based on Split-C rather than UPC
11/6/2015
CS267, Yelick
36
An Irregular Problem: EM3D
Maxwells Equations on an Unstructured 3D Mesh: Explicit Method
Irregular Bipartite Graph of varying degree
(about 20) with weighted edges
v1
v2
w1
w2
H
E
B
Basic operation is to subtract weighted sum of
D neighboring values
for all E nodes
for all H nodes
EM3D: Uniprocessor Version
typedef struct node_t {
double value;
int edge_count;
H
E
double *coeffs;
double *(*values);
value
struct node_t *next;
coeffs
values
} node_t;
void all_compute_E() {
node_t *n;
int i;
value
for (n = e_nodes; n; n = n->next) {
for (i = 0; i < n->edge_count; i++)
n->value = n->value *(n->values[i]) * (n->coeffs[i]);
}
}
How would you optimize this for a uniprocessor?
– minimize cache misses by organizing list such that
neighboring nodes are visited in order
EM3D: Simple Parallel Version
Each processor has list of local nodes
typedef struct node_t {
double value;
int edge_count;
double *coeffs;
v1
double *global (*values);
struct node_t *next;
v2
} node_t;
v3
void all_compute_e() {
node_t *n;
proc M
proc N
int i;
for (n = e_nodes; n; n = n->next) {
for (i = 0; i < n->edge_count; i++)
n->value = n->value *(n->values[i]) * (n->coeffs[i]);
}
barrier();
}
How do you optimize this?
– Minimize remote edges
– Balance load across processors:
C(p) = a*Nodes + b*Edges + c*Remotes
EM3D: Eliminate Redundant Accesses
v1
void all_compute_e()
{
v2
ghost_node_t *g;
node_t *n;
v3
proc M
proc N
int i;
for (g = h_ghost_nodes; g; g = g->next)
>rval);
g->value = *(g-
for (n = e_nodes; n; n = n->next) {
for (i = 0; i < n->edge_count; i++)
n->value = n->value - *(n->values[i]) * (n>coeffs[i]);
}
barrier();
}
EM3D: Overlap Global Reads: GET
v1
v2
void all_compute_e()
{
ghost_node_t *g;
v3
proc M
proc N
node_t *n;
int i;
for (g = h_ghost_nodes; g; g = g->next) g->value := *(g->rval);
sync();
for (n = e_nodes; n; n = n->next) {
for (i = 0; i < n->edge_count; i++)
n->value = n->value - *(n->values[i]) * (n->coeffs[i]);
}
barrier();
}
Split-C: Performance Tuning on the CM5
• Tuning affects application performance
1.2
1
µs per edge
0.8
0.6
em3d.simple
bundle.unopt
bundle.opt
em3d.get
em3d.bulk
0.4
0.2
0
0
10
20
30
40
% Remote
11/6/2015
CS267, Yelick
42
Early UPC Experience
at GMU
Codes Considered
• An early release of our synthetic UPC Benchmark
• UPC Benchmarks focus on compiler implementation
problems and it consists of
• A synthetic benchmark UPC_Synthetic
• Some applications UPC_Applications
•
•
•
•
Matrix multiply
Sobel Edge
Nqueens
Underway:
• NAS Parallel Benchmarks (EP and MG in good shape)
• SPLASH (Nbody …)
• UPC benchmarks at NERSC/LBNL: NAS CG
11/6/2015
CS267, Yelick
44
Opportunities for Performance Improvement
• Compiler optimizations
• Run-time system
• Hand tuning
11/6/2015
CS267, Yelick
45
Compiler Optimizations
• Absolutely the best
• Ease of programming
• Low performance overhead as compared to run-time
• UPC specific optimizations are not there yet
11/6/2015
CS267, Yelick
46
List of Optimizations for UPC Code
1.
Space privatization: use private pointers instead of
shared pointers when dealing with local shared data
(through casting and assignments)
2.
Block moves: use block copy instead of copying
elements one by one with a loop, through string
operations or structures
Latency hiding: For example, overlap remote
accesses with local processing using split-phase
barriers
3.
11/6/2015
CS267, Yelick
47
UPC versus MPI for Edge detection
Execution time(N=512)
Speedup(N=512)
20
0.07
0.06
UPC O1+O2
0.05
MPI
16
14
Speedup
Time(s)
UPC O1+O2
MPI
Linear
18
12
0.04
10
0.03
8
0.02
6
0.01
4
0.00
2
0
5
10
15
Proc.
20
0
0
Proc. 10
15
20
b. Scalability
a. Execution time
11/6/2015
5
CS267, Yelick
48
UPC versus MPI for Matrix Multiplication
Execution time
Speedup
7
20
Time(s)
6
5
UPC O1 + O2
4
MPI
UPC O1+O2
15
MPI
10
Linear
3
2
5
1
0
0
0
0
5
10
15
20
5
10
15
20
Proc.
Proc.
b. Scalability
a. Execution time
11/6/2015
CS267, Yelick
49
UPC History and Status
11/6/2015
CS267, Yelick
50
Hardware Platforms
• UPC implementations are available for
• Cray T3D/E (V3.1.9)
• Compaq AlphaServer SC (V1.51)
• SGI
• Ongoing and future implementations for:
•
•
•
•
•
HP
Sun multiprocessors
Cray SV-2
Beowulf Clusters
IBM SPs
11/6/2015
CS267, Yelick
51
Compiling and Running on a Cray
• Cray
• Set your path (at end of .login, for example):
• set path = ($path /u/c/ciancu/upc/upc3.2/bin)
• Compile with a fixed number (4) of threads:
• upc –O2 –fthreads-4 –o vadd vadd.c
• To run:
• ./vadd
11/6/2015
CS267, Yelick
52
The Short Story of UPC
• Start with C, the other proven language besides
FORTRAN
• Keep all powerful C concepts and features
• Add parallelism, learn from Split-C, AC, PCP:
• Split-C: Designed for MPPs, based on Active Messages
• AC: Vector language for CM5, then t3d/e languages
• PCP: shared memory language, extended with global pointers
• Integrate user community experience and experimental
performance observations
• Integrate developer’s expertise from vendors,
government, and academia
 UPC !
11/6/2015
CS267, Yelick
53
Design Philosophy
• Similar to the C language philosophy
• Programmers are clever and careful
• Programmers can get close to hardware
• to get performance, but
• can get in trouble
• Concise and efficient syntax
• Common and familiar syntax and semantics for
parallel C with simple extensions to ANSI C
• Allow easy implementations onto different
architectures
• Provide high-performance at the
• Processor level
• Communication layer
11/6/2015
CS267, Yelick
54
History
• Initial Tech. Report from IDA in collaboration with
LLNL and UCB in May 1999 (led by IDA).
• UPC consortium of government, academia, and HPC
vendors coordinated by GMU, IDA, and NSA.
• The participants currently are:
• ARSC, Compaq, CSC, Cray Inc., Etnus, GMU, HP, IDA CCS,
Intrepid Technologies, LBNL, LLNL, MTU, NSA, SGI, Sun
Microsystems, UCB, US DOD
• Meetings:
• Spring 1997: Researchers from AC, Split-C, and PCP
• May 2000 in Bowie, Maryland: First consortium meeting
• November 2000 in Dallas: Second consortium meeting
11/6/2015
CS267, Yelick
55
Documentation
• On the Web at: http://upc.gwu.edu/
•
•
•
•
Specification v1.0 completed February of 2001
Benchmark, UPC_Bench, v1.0pre1, released by GMU
Testing suite released by GMU, v1.0pre1
UPC course continuously updated and offered at NSA, UK,
NASA GSFC, ..
• UPC Book in progress
11/6/2015
CS267, Yelick
56
Concluding Remarks
11/6/2015
CS267, Yelick
57
Notes on Usability
• UPC is easy to program in for C writers, easier than
current alternatives
• UPC and the distributed shared memory
programming model combine the ease of shared
memory programming and the flexibility of message
passing
• Hand tuning may sometimes make parts of the code
slightly looking like message passing, when it
becomes efficient to make local copies of data.
• This is not likely to be the case for the majority of the
data and therefore the code will continue to be more
UPC-like, and not as complicated as message
passing
11/6/2015
CS267, Yelick
58
Notes on Performance
• UPC exhibits very little overhead when compared with
MPI for problems that are embarrassingly parallel. No
tuning is necessary.
• For other problems compiler optimizations are
happening but not fully there
• With hand-tuning, UPC compared favorably with MPI
on the Compaq AlphaServer for most applications
considered
• Shared objects seem to add very large overhead,
specially when code is not tuned. So far, tuning
overcome that for the used applications
11/6/2015
CS267, Yelick
59
General Remarks
• As compiler optimizations effectively exploit the
opportunities for improvements, there will be no need
for most hand tuning
• Should this happen UPC and DSMP are likely to
become the way to go
• At the abstract level DSMP is more powerful than MP,
DP, and SM models. Challenges are at the
implementation level and might be alleviated with
clever implementations and/or simple
O.S./architecture support
• Even if UPC does not outperform MPI in general,
although it should happen, UPC will provide an
alternative to MPI which for many problems will be
easier to use and will provide shorter turn around
11/6/2015
CS267, Yelick
60