Heterogeneous Computing in Charm++ David Kunzman Motivations • Performance and Popularity of Accelerators – Our work currently focuses on Cell (and Larrabee) – Difficult to.

Download Report

Transcript Heterogeneous Computing in Charm++ David Kunzman Motivations • Performance and Popularity of Accelerators – Our work currently focuses on Cell (and Larrabee) – Difficult to.

Heterogeneous Computing
in Charm++
David Kunzman
Motivations
• Performance and Popularity of Accelerators
– Our work currently focuses on Cell (and Larrabee)
– Difficult to program accelerators
• Architecture specific code (not portable)
• Many asynchronous events (data movement, multiple cores)
• Heterogeneous Clusters Exist Already
– Roadrunner at LANL (Opterons and Cells)
– Lincoln at NCSA (Xeons and GPUs)
– MariCel at BSC (Powers and Cells)
Goals
• Portability of code
– Code should be portable between systems with and without
accelerators
– Across homogeneous and heterogeneous clusters
– Reduce programmer effort
• Allow various pieces of code to be written independently
– Pieces of code share the accelerator(s)
– Scheduled by the runtime system automatically
• Naturally extend the existing Charm++ model
– Same programming model for all hosts and accelerators
Approach
• Make entry methods portable between host and
accelerator cores
– Allows the programmer to write entry method code
once and use the same code for all cores
– Still make use of architecture/core specific features
• Take advantage of the clear communication
boundaries in Charm++
– Almost all data is encapsulated within chare objects
– Data is passed between chare objects by invoking
entry methods
Extending Charm++
• SIMD Instruction Abstraction
– To reach any significant fraction of peak, must use
SIMD instructions on modern cores
– Abstract SIMD instructions so code is portable
• Accelerated Entry Methods
– May execute on accelerators
– Essentially a standard entry method split into two
stages
• Function body (accelerator or host; limited)
• Callback function (host; not limited)
SIMD Instruction Abstraction
• Abstract SIMD instructions supported by
multiple architectures
– Currently adding support for: SSE (x86),
AltiVec/VMX (PowerPC; PPE), SIMD
instructions on SPEs, and Larrabee
– Generic C implementation when no direct
architectural support is present
– Types: vecf, veclf, veci, ...
– Operations: vaddf, vmulf, vsqrtf, ...
Example Entry Method
entry void accum(int inArrayLen, float inArray[inArrayLen]) {
if (inArrayLen != localArrayLen) return;
for (int i = 0; i < inArrayLen; ++i)
localArray[i] = localArray[i] + inArray[i];
};
To Invoke: myChareObj.accum(someFloatArray_len, someFloatArray_ptr);
Example Entry Method w/ SIMD
entry void accum(int inArrayLen, align(sizeof(vecf)) float inArray[inArrayLen]) {
if (inArrayLen != localArrayLen) return;
vecf *inArrayVec = (vecf*)inArray;
vecf *localArrayVec = (vecf*)localArray;
int arrayVecLen = inArrayLen / vecf_numElems;
for (int i = 0; i < arrayVecLen; ++i)
localArrayVec[i] = vaddf(localArrayVec[i], inArrayVec[i]);
for (int i = arrayVecLen * vecf_numElems; i < inArrayLen; ++i)
localArray[i] = localArray[i] + inArray[i];
};
To Invoke: myChareObj.accum(someFloatArray_len, someFloatArray_ptr);
Accel Entry Method Structure
Standard
Accelerated
Interface File:
entry void entryName
( …passed parameters… );
Interface File:
entry [accel] void entryName
( …passed parameters… )
vs.
[ …local parameters… ]
Source File:
{ … function body … }
void ChareClass::entryName
callback_member_function;
( …passed parameters … )
{ … function body … }
Invocation (both): chareObj.entryName(… passed parameters …)
Example Accelerated Entry Method
entry [accel] void accum(int inArrayLen, align(sizeof(vecf)) float inArray[inArrayLen])
[ readOnly : int localArrayLen <impl_obj->localArrayLen>,
readWrite : float localArray[localArrayLen] <impl_obj->localArray> ] {
if (inArrayLen != localArrayLen) return;
vecf *inArrayVec = (vecf*)inArray;
vecf *localArrayVec = (vecf*)localArray;
int arrayVecLen = inArrayLen / vecf_numElems;
for (int i = 0; i < arrayVecLen; ++i)
localArrayVec[i] = vaddf(localArrayVec[i], inArrayVec[i]);
for (int i = arrayVecLen * vecf_numElems; i < inArrayLen; ++i)
localArray[i] = localArray[i] + inArray[i];
} accum_callback;
To Invoke: myChareObj.accum(someFloatArray_len, someFloatArray_ptr);
Timeline of Events
• Runtime system…
– Directs data movement (messages & DMAs)
– Schedules accelerated entry methods and callbacks
Communication Overlap
• Data movement automatically overlapped with
accelerated entry method execution on SPEs
and entry method execution on PPE
Handling Host Core Differences
• Automatic modification of
application data at
communication boundaries
– Structure of data is known via
parameters and Pack-UnPack
(PUP) routines
– During packing process, add
information on how the data is
encoded
– During unpacking, if needed,
modify data to match local
architecture
Molecular Dynamics (MD) Code
• Based on object interaction seen in NAMD’s nonbonded
electrostatic force computation (simplified)
– Coulomb’s Law
– Single precision floating-point
• Particles evenly divided between patch objects
– ~92K particles in 144 patches (similar to ApoA1 benchmark)
• Compute objects (self and pair wise) compute forces for
patch objects
• Patches integrate combined force data and update
particle positions
MD Code Results
• Executing on 2 Xeons cores, 8 PPEs, and 56 SPEs
– 3 ISAs, 3 SIMD instruction extensions, and 2 memory structures
– Better scaling is achieved when Xeons are present
– 331.1 GFlop/s (19.82% peak; serial code limited to 27.7% peak
on one SPE, assuming that SPE has an infinite local store)
Visualizing MD Code Execution
Summary
• Support for accelerators and heterogeneous
execution in Charm++
– Programming model and runtime system changes
•
•
•
•
Accelerated entry methods
SIMD instruction abstraction
Automatic modification of application data
Visualization support
– Support
• Currently supports Cell
• Adding support for Larrabee
• Clusters where host cores have different architectures
Future Work
• Dynamic measurement based load
balancing on heterogeneous systems
• Increase support for more accelerators
– In the process of adding support for Larrabee
– Increasing support for existing abstractions
and/or developing new abstractions
Questions