PowerPoint-presentatie
Download
Report
Transcript PowerPoint-presentatie
Programming Models for multi-cores
Programming Models for Multi-Cores
Ana Lucia Varbanescu
TUDelft / Vrije Universiteit Amsterdam
with acknowledgements to
Maik Nijhuis @ VU
Xavier Matorell @ UPC, Rosa Badia @ BSC
Delft
University of
Technology
Challenge the future
Outline
An introduction
Programming the Cell/B.E.
available models.
… can we compare them ?!
More processors, more models …?
CUDA, Brook, TBB, ct, Sun Studio, …
… or a single standard one ?
OpenCL standard = the solution?
Conclusions
2 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
An introduction
3 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
The Problem
Cell/B.E. = High performance
Cell/B.E. != Programmability
Is there a way to match the two ?
4 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Cell/B.E.
1 x PPE 64-bit PowerPC (L1: 32KB I$ + 32 KB D$; L2: 512 KB)
8 x SPE cores (LS: 256KB, SIMD machines)
Hybrid memory model
Cell blades (QS20/21): 2xCell / PS3: 1xCell (6 SPEs only)
Thread-based model, push/pull data
Thread scheduling by user
Five layers of parallelism:
Task parallelism (MPMD)
Data parallelism (SPMD)
Data streaming parallelism (DMA double buffering)
Vector parallelism (SIMD – up to 16-ways)
Pipeline parallelism (dual-pipelined SPEs)
5 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Programming the Cell/B.E.
A view from the application:
High-level parallelization => application task-graph
Mapping/Scheduling => mapped graph
In-core optimizations => optimized code for each core
A high-level programming model should “capture” all three
aspects of Cell applications!
High-level
Mapping
Core-level
6 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Expressing the task graph
Task definition
A task is a tuple : <Inputs,Outputs,Computation[,Data-Par]>
Task interconnections
Express top level application parallelism and data dependencies
Task synchronization
Allow for barriers and other mechanisms, external from the tasks
Task composition
Tasks should be able to split/merge with other tasks.
High-level
Mapping
Core-level
7 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Mapping/scheduling
Task-graph “expansion” (Auto)
Data parallelism and synchronization are transformed in nodes and
edges
Application mapping (User-aid)
All potential mappings should be considered
Mapping optimizations (User-aid)
Merge/split tasks to fit the target core and minimize communication
Scheduling (User)
Establish how to deal with contention at the core level
High-level
Mapping
Core-level
8 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Core-level
Computation (User)
Allow the user to write the computation code (sequential)
Core optimizations (User/Auto)
Per-core optimizations (different on PPE and SPEs)
Memory access (Auto)
Hide explicit DMA
Optimize DMA (Auto)
Overlap computation with communication
High-level
Mapping
Core-level
9 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Extra’s
Performance estimation
Application performance should be roughly predicted based on the
task graph
Warnings and hints
Better warnings and hints to replace the standard SDK messages
High-level
Mapping
Core-level
10 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Available Cell/B.E. programming models
SDK-based models
IDL, ALF
Code-reuse models
MPI (micro-tasks), OpenMP
CellSS
Abstract models
Sequoia
Charm++ and the Offload API
SP@CE
Industry
PeakStream, RapidMind, the MultiCore Framework
Other approaches
MultiGrain Parallelism Scheduling, BlockLib, Sieve++
11 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
IDL and the Function-Offload Model
Offloads computation-intensive tasks on SPEs
Programmer provides:
Sequential code to run on PPE
SPE implementations for offloaded functions
IDL specification for function behaviour
Dynamic scheduling, based on distributed SPE queues
12 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Accelerated Library Framework (ALF)
SPMD applications on a host-accelerator platform
Programmer provides:
Accelerator libraries - collections of accelerated code
Application usage of the accelerator libraries
Runtime scheduling
13 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
MPI micro-tasks
MPI front-end on the Cell/B.E.
Programmer provides:
MPI application
Preprocessor generates application graph with basic tasks
Basic tasks are merged together such that the graph is SP
The SP graph is mapped automatically
Core-level communication optimizations are automatic
15 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
OpenMP
Based on pragma's
Enables code re-use
Programmer provides:
OpenMP application
Core-level optimizations
DMA optimizations
Mapping and scheduling: automated
Most work on the compiler side
16 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Cell SuperScalar (CellSS)
Very good for quick porting of applications on the Cell/B.E.
Programmer provides:
Sequential C application
Pragma’s to separate functions to be offloaded
Additional data distribution information
Based on a compiler and a run-time system
The compiler separates the annotated application into a PPE
application and the SPE application
The runtime system maintains a dynamic data dependency graph
with all these active tasks, updating it each time a task starts/ends
Dynamic scheduling
based on the runtime calculation of the data dependency graph.
17 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Sequoia
High-level abstract model, suitable for divide-and-conquer
applications
Uses the memory locality as a first parallelization criteria
Application = hierarchy of parameterized, recursively decomposed
tasks
Tasks run in isolation (data locality)
Programmer provides:
Application hierarchical graph
A mapping of the graph on the platform
(Optimized) Code for the leaf-nodes
A flexible environment for tuning and testing application
performance
18 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
SP@CE
Dedicated to streaming applications
An application is a collection of kernels that communicate only
by data streaming
Programmer provides:
Application streaming graph (XML)
(Library of) Optimized kernels for the SPEs
Dynamic scheduling, based on a centralized job-queue
Run-time system on the SPEs, to optimize (some)
communication overhead
19 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Charm++ and the Offload API
An application = a collection of chares
Communicate through messages
Created and/or destroyed at runtime
A chare has a list of work requests to run on SPE
PPE: uses the offload API to manage the work requests (data flow,
execution, completion)
SPE: a small runtime system for local management and
optimizations
Programmer provides:
Charm++ application
Work requests and their SPE code
20 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
RapidMind
Based on “SPMD streaming”
tasks are executed on parallelized streams of data.
A kernel (“program”) is a computation on elements of a vector
An application is a combination of regular code and RapidMind
code => compiler translates into PPE code and SPEs code
Programmer provides:
C++ application
Computation kernels inside the application
Kernels can execute asynchronously => achieve task-parallelism
21 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
MultiCore Framework SDK (Mercury)
A master-worker model
focused on data parallelism and data distributions
An application = manager (on PPE) and workers (on SPEs)
Data communication is based on:
virtual channels: between manager and worker(s)
data objects: to specify data granularity and distribution
elements read/written are different at the channel ends
Programmer provides:
C code for the kernels
The channels interconnections via read/write ops
Data distribution objects for each channel
No parallelization support, no core optimizations, no applicationlevel design.
22 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Brief overview
23 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Features - revisited
24 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
How to compare performance?
Implement one application from scratch
Impractical and very time consuming
Use an already given benchmark
Matrix multiplication is available
25 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Performance
See examples …
26 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Are results relevant ?
Only partially !
MMUL is NOT a good benchmark for high-level programming
models
The results reveal the low-level optimizations success
The implementations are VERY different
Hard to measure computation only
Data distribution issues are very differently addressed
Overall, a better approach for performance comparison is
needed!!!
Benchmark application
Set of metrics
27 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Still …
Low-level optimizations are not part of the programming model’s
targets => can/should be designed separately and heavily
reused
The performance overhead induced by the design and/or
implementation in a high-level model decreases with the size of
the application
The programming effort spent on SPE optimizations increases
the overall application implementation design with a constant
factor, independent of the chosen programming model.
28 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Usability
29 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
The Answers [1/2]
High-level programming models cover enough features to
support application design and implementation at all levels.
Low-level optimizations and high-level algorithm parallelization
remain difficult tasks for the programmer.
No single Cell/B.E. programming model that can address all
application types
> 90%
High-level
0-100%
Mapping
> 50%
Core-level
30 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
The Answers [2/2]
Alleviate the programmability issue:
Preserve the high Cell/B.E. performance:
Are easy to use ?
Allow for automation ?
Is there an ideal one ?
60 %
90 %
10-90 %
50 %
NO
31 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
GPU Models [1/2]
GPGPU used to be fancy
OpenCL
cG
RapidMind
NVIDIA GPUs
CUDA is an original HW-SW codesign approach
Extremely popular
Considered easy to use
ATI/AMD GPUs
Originally Brook
Currently ATI Stream SDK
32 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
GPU Models [2/2]
NVIDIA GPUs
CUDA is an original HW-SW codesign approach
Extremely popular
Considered easy to use
ATI/AMD GPUs
Originally Brook
Currently ATI Stream SDK
33 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
OpenCL [1/4]
Currently up and running for :
AMD/ATI, IBM, NVIDIA, Apple
Other members of the Khronos consortium to follow
ARM, Intel [?]
See examples …
34 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
OpenCL [2/4]
Language Specification
C-based cross-platform programming interface
Subset of ISO C99 with language extensions - familiar to developers
Online or offline compilation and build of compute kernel
executables
Platform Layer API
A hardware abstraction layer over diverse computational resources
Query, select and initialize compute devices
Create compute contexts and work-queues
Runtime API
Execute compute kernels
Manage scheduling, compute, and memory resources
35 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
OpenCL [3/4] – memory model
multi-level memory model
private memory visible only to the individual compute units in the
device
global memory visible to all compute units on the device.
depending on the HW, memory spaces can be collapsed together.
4 memory spaces
Private memory: a single compute unit (think registers).
Local memory: work-items in a work-group.
Constant memory: stores constant data for read-only access
Global memory: used by all the compute units on the device.
36 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
OpenCL [4/4] – execution model
Execution model
Compute kernels can be thought either data-parallel (for for GPUs),
or task-parallel, which is well-matched to the architecture of CPUs.
A compute kernel is the basic unit of executable code and can be
thought of as similar to a C function.
kernels execution can be in-order or out-of-order
Events for the developer to check on the status of runtime requests.
The execution domain of a kernel
an N-dimensional computation domain.
each element in the execution domain is a work-item
work-items can be clustered into work-groups for synchronization
and communication.
37 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Conclusions
A multitude of programming models
Aboundent for Cell/B.E., due to original lack of high-level
programming
Less so for GPUs, due to CUDA
Simple programming models are key to platform adoption
CUDA
Essential features are:
Tackling *all* parallelism layers of a platform
Both automagically and with user-intervention
Portability
Ease-of-use or a very steep learning curve (C-based works)
(Control over) Performance
Most of the times, efficiency
38 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Take home messages
Application parallelization remains the programmer's task
Programming models should facilitate quick implementation and
evaluation
Programming models are hard to compare
Application-specific or platform-specific
Often user-specific
Low portability is considered worse than performance drops
Performance trade-offs are smaller than expected
OpenCL’s portability is responsible (so far) for his appeal
39 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Thank you!
Questions ?
[email protected]
[email protected]
http://www.pds.ewi.tudelft.nl/~varbanescu
40 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors