PowerPoint-presentatie

Download Report

Transcript PowerPoint-presentatie

Programming Models for multi-cores
Programming Models for Multi-Cores
Ana Lucia Varbanescu
TUDelft / Vrije Universiteit Amsterdam
with acknowledgements to
Maik Nijhuis @ VU
Xavier Matorell @ UPC, Rosa Badia @ BSC
Delft
University of
Technology
Challenge the future
Outline
 An introduction
 Programming the Cell/B.E.
available models.
 … can we compare them ?!

 More processors, more models …?

CUDA, Brook, TBB, ct, Sun Studio, …
 … or a single standard one ?

OpenCL standard = the solution?
 Conclusions
2 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
An introduction
3 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
The Problem
 Cell/B.E. = High performance
 Cell/B.E. != Programmability
 Is there a way to match the two ?
4 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Cell/B.E.




1 x PPE 64-bit PowerPC (L1: 32KB I$ + 32 KB D$; L2: 512 KB)
8 x SPE cores (LS: 256KB, SIMD machines)
Hybrid memory model
Cell blades (QS20/21): 2xCell / PS3: 1xCell (6 SPEs only)
Thread-based model, push/pull data
 Thread scheduling by user
Five layers of parallelism:
 Task parallelism (MPMD)
 Data parallelism (SPMD)
 Data streaming parallelism (DMA double buffering)
 Vector parallelism (SIMD – up to 16-ways)
 Pipeline parallelism (dual-pipelined SPEs)
5 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Programming the Cell/B.E.
 A view from the application:
High-level parallelization => application task-graph
 Mapping/Scheduling => mapped graph
 In-core optimizations => optimized code for each core

 A high-level programming model should “capture” all three
aspects of Cell applications!
High-level
Mapping
Core-level
6 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Expressing the task graph
 Task definition

A task is a tuple : <Inputs,Outputs,Computation[,Data-Par]>
 Task interconnections

Express top level application parallelism and data dependencies
 Task synchronization

Allow for barriers and other mechanisms, external from the tasks
 Task composition

Tasks should be able to split/merge with other tasks.
High-level
Mapping
Core-level
7 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Mapping/scheduling
 Task-graph “expansion” (Auto)

Data parallelism and synchronization are transformed in nodes and
edges
 Application mapping (User-aid)

All potential mappings should be considered
 Mapping optimizations (User-aid)

Merge/split tasks to fit the target core and minimize communication
 Scheduling (User)

Establish how to deal with contention at the core level
High-level
Mapping
Core-level
8 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Core-level
 Computation (User)

Allow the user to write the computation code (sequential)
 Core optimizations (User/Auto)

Per-core optimizations (different on PPE and SPEs)
 Memory access (Auto)

Hide explicit DMA
 Optimize DMA (Auto)

Overlap computation with communication
High-level
Mapping
Core-level
9 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Extra’s
 Performance estimation

Application performance should be roughly predicted based on the
task graph
 Warnings and hints

Better warnings and hints to replace the standard SDK messages
High-level
Mapping
Core-level
10 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Available Cell/B.E. programming models
 SDK-based models

IDL, ALF
 Code-reuse models
MPI (micro-tasks), OpenMP
 CellSS

 Abstract models
Sequoia
 Charm++ and the Offload API
 SP@CE

 Industry

PeakStream, RapidMind, the MultiCore Framework
 Other approaches

MultiGrain Parallelism Scheduling, BlockLib, Sieve++
11 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
IDL and the Function-Offload Model
 Offloads computation-intensive tasks on SPEs
 Programmer provides:
Sequential code to run on PPE
 SPE implementations for offloaded functions
 IDL specification for function behaviour

 Dynamic scheduling, based on distributed SPE queues
12 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Accelerated Library Framework (ALF)
 SPMD applications on a host-accelerator platform
 Programmer provides:
Accelerator libraries - collections of accelerated code
 Application usage of the accelerator libraries

 Runtime scheduling
13 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
MPI micro-tasks
 MPI front-end on the Cell/B.E.
 Programmer provides:





MPI application
Preprocessor generates application graph with basic tasks
Basic tasks are merged together such that the graph is SP
The SP graph is mapped automatically
Core-level communication optimizations are automatic
15 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
OpenMP
 Based on pragma's
 Enables code re-use
 Programmer provides:
OpenMP application
 Core-level optimizations
 DMA optimizations

 Mapping and scheduling: automated

Most work on the compiler side
16 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Cell SuperScalar (CellSS)
 Very good for quick porting of applications on the Cell/B.E.
 Programmer provides:
Sequential C application
 Pragma’s to separate functions to be offloaded
 Additional data distribution information

 Based on a compiler and a run-time system
The compiler separates the annotated application into a PPE
application and the SPE application
 The runtime system maintains a dynamic data dependency graph
with all these active tasks, updating it each time a task starts/ends

 Dynamic scheduling

based on the runtime calculation of the data dependency graph.
17 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Sequoia
 High-level abstract model, suitable for divide-and-conquer
applications
Uses the memory locality as a first parallelization criteria
 Application = hierarchy of parameterized, recursively decomposed
tasks
 Tasks run in isolation (data locality)

 Programmer provides:
Application hierarchical graph
 A mapping of the graph on the platform
 (Optimized) Code for the leaf-nodes

 A flexible environment for tuning and testing application
performance
18 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
SP@CE
 Dedicated to streaming applications
 An application is a collection of kernels that communicate only
by data streaming
 Programmer provides:
Application streaming graph (XML)
 (Library of) Optimized kernels for the SPEs

 Dynamic scheduling, based on a centralized job-queue
 Run-time system on the SPEs, to optimize (some)
communication overhead
19 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Charm++ and the Offload API
 An application = a collection of chares
Communicate through messages
 Created and/or destroyed at runtime

 A chare has a list of work requests to run on SPE
PPE: uses the offload API to manage the work requests (data flow,
execution, completion)
 SPE: a small runtime system for local management and
optimizations

 Programmer provides:
Charm++ application
 Work requests and their SPE code

20 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
RapidMind
 Based on “SPMD streaming”

tasks are executed on parallelized streams of data.
 A kernel (“program”) is a computation on elements of a vector
 An application is a combination of regular code and RapidMind
code => compiler translates into PPE code and SPEs code
 Programmer provides:
C++ application
 Computation kernels inside the application

 Kernels can execute asynchronously => achieve task-parallelism
21 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
MultiCore Framework SDK (Mercury)
 A master-worker model

focused on data parallelism and data distributions
 An application = manager (on PPE) and workers (on SPEs)
 Data communication is based on:
virtual channels: between manager and worker(s)
 data objects: to specify data granularity and distribution
 elements read/written are different at the channel ends

 Programmer provides:
C code for the kernels
 The channels interconnections via read/write ops
 Data distribution objects for each channel

 No parallelization support, no core optimizations, no applicationlevel design.
22 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Brief overview
23 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Features - revisited
24 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
How to compare performance?
 Implement one application from scratch

Impractical and very time consuming
 Use an already given benchmark

Matrix multiplication is available
25 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Performance
 See examples …
26 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Are results relevant ?
 Only partially !
MMUL is NOT a good benchmark for high-level programming
models
 The results reveal the low-level optimizations success

 The implementations are VERY different
 Hard to measure computation only

Data distribution issues are very differently addressed
 Overall, a better approach for performance comparison is
needed!!!
Benchmark application
 Set of metrics

27 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Still …
 Low-level optimizations are not part of the programming model’s
targets => can/should be designed separately and heavily
reused
 The performance overhead induced by the design and/or
implementation in a high-level model decreases with the size of
the application
 The programming effort spent on SPE optimizations increases
the overall application implementation design with a constant
factor, independent of the chosen programming model.
28 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Usability
29 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
The Answers [1/2]
 High-level programming models cover enough features to
support application design and implementation at all levels.
 Low-level optimizations and high-level algorithm parallelization
remain difficult tasks for the programmer.
 No single Cell/B.E. programming model that can address all
application types
> 90%
High-level
0-100%
Mapping
> 50%
Core-level
30 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
The Answers [2/2]




Alleviate the programmability issue:
Preserve the high Cell/B.E. performance:
Are easy to use ?
Allow for automation ?
 Is there an ideal one ?
60 %
90 %
10-90 %
50 %
NO
31 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
GPU Models [1/2]
 GPGPU used to be fancy
OpenCL
 cG
 RapidMind

 NVIDIA GPUs
CUDA is an original HW-SW codesign approach
 Extremely popular
 Considered easy to use

 ATI/AMD GPUs
Originally Brook
 Currently ATI Stream SDK

32 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
GPU Models [2/2]
 NVIDIA GPUs
CUDA is an original HW-SW codesign approach
 Extremely popular
 Considered easy to use

 ATI/AMD GPUs
Originally Brook
 Currently ATI Stream SDK

33 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
OpenCL [1/4]
 Currently up and running for :

AMD/ATI, IBM, NVIDIA, Apple
 Other members of the Khronos consortium to follow

ARM, Intel [?]
 See examples …
34 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
OpenCL [2/4]
 Language Specification
C-based cross-platform programming interface
 Subset of ISO C99 with language extensions - familiar to developers
 Online or offline compilation and build of compute kernel
executables

 Platform Layer API
A hardware abstraction layer over diverse computational resources
 Query, select and initialize compute devices
 Create compute contexts and work-queues

 Runtime API
Execute compute kernels
 Manage scheduling, compute, and memory resources

35 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
OpenCL [3/4] – memory model
 multi-level memory model
private memory visible only to the individual compute units in the
device
 global memory visible to all compute units on the device.
 depending on the HW, memory spaces can be collapsed together.

 4 memory spaces
Private memory: a single compute unit (think registers).
 Local memory: work-items in a work-group.
 Constant memory: stores constant data for read-only access
 Global memory: used by all the compute units on the device.

36 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
OpenCL [4/4] – execution model
 Execution model
Compute kernels can be thought either data-parallel (for for GPUs),
or task-parallel, which is well-matched to the architecture of CPUs.
 A compute kernel is the basic unit of executable code and can be
thought of as similar to a C function.
 kernels execution can be in-order or out-of-order
 Events for the developer to check on the status of runtime requests.

 The execution domain of a kernel
an N-dimensional computation domain.
 each element in the execution domain is a work-item
 work-items can be clustered into work-groups for synchronization
and communication.

37 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Conclusions
 A multitude of programming models
Aboundent for Cell/B.E., due to original lack of high-level
programming
 Less so for GPUs, due to CUDA

 Simple programming models are key to platform adoption

CUDA
 Essential features are:

Tackling *all* parallelism layers of a platform

Both automagically and with user-intervention
Portability
 Ease-of-use or a very steep learning curve (C-based works)
 (Control over) Performance


Most of the times, efficiency
38 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Take home messages
 Application parallelization remains the programmer's task

Programming models should facilitate quick implementation and
evaluation
 Programming models are hard to compare
Application-specific or platform-specific
 Often user-specific

 Low portability is considered worse than performance drops
Performance trade-offs are smaller than expected
 OpenCL’s portability is responsible (so far) for his appeal

39 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors
Thank you!
 Questions ?
[email protected]
[email protected]
http://www.pds.ewi.tudelft.nl/~varbanescu
40 | 79.95
A.L.Varbanescu @ TUD – Programming Models for multi-core processors