Project F2: Update 10/01/2007

Download Report

Transcript Project F2: Update 10/01/2007

System Coordination Library (SCL)
Framework
Vikas Aggarwal
Rafael Garcia
Abraham Sanchez
Philips Shih
Challenges & Problems

FPGAs and other devices (eg. Cell &
GPUs) gaining popularity as accelerators


Development support for large-scale applications is lacking



Device design languages for FPGAs are migrating towards true HLL
Missing piece: System-level Coordination Library, extension to HLL
Complete lack of inter-operability, several IDEs and devices
gaining popularity in smaller domains


Lack of direct co-ordination amongst devices precludes usage as
peers in massively parallel machines
Standardization of communication, compatibility amongst different
devices is highly desirable to capture larger user-base
Lack of transition from Formulation phase to Design phase
2
Proposed Solution

Design a System Coordination Library to
facilitate coordination amongst heterogeneous
set of devices


Provide a familiar coordination/communication
interface to parallel program developers, employ MPI-like interfaces
Standardize coordination primitives across different technologies


Provide a higher level of abstraction for communication
Allows applications to be more portable across changing platforms


Provide communication based on relevant communication infrastructure


Life cycles of software are generally longer than the corresponding hardware version
Build communication from bottoms up, employing existing work and effort like MPI, GenAPI etc.
Provide a transition from Formulation phase to Design phase


Allow parallel programs to be expressed as task graphs
Provide a framework to auto-generate communication infrastructure based on
mapping of tasks to different devices
3
DARPA Study – Quick Glance
Formulation -- strategic design abstraction
F
Formulation – prediction, tradeoff analysis
Design – system coordination language
FPGA devices
x86
etc.
(e.g. Stratix-II/III, Virtex-4/5)
Cell
…
VHDL
etc.
 Device design languages for FPGA
devices are migrating upward in
abstraction towards true HLL
Gedae
Impulse C
Carte C
D
AccelDSP
Design – device design languages
Design -- library reuse (modules, cores)
T
E
Translation
Execution
4
 Missing piece in Design layer is System
Coordination Library, extension to HLL
Bigger Picture

Formulation enables abstract modeling of algorithms



Allows decomposition of apps into
constituent tasks
Allows automated performance prediction
for a particular algorithm decomposition
Distributor
t1
Row
FFT
Row
RowFFT
FFT
Missing Components



Several techniques have reaped benefits of
automated DSE in conventional computing
Bridging Formulation and Design phases

Providing automatically generated framework
for communication between tasks
t5
Column
FFT
Row
RowFFT
FFT
Column
FFT
Row
RowFFT
FFT
t3
Re-organize
data
Corresponding
task graph of
application
Suggested
mapping of
tasks on
resources
Row
FFT
Row
RowFFT
FFT
t2
Multi-FPGA applications still present a
major development bottleneck
Automated grouping & mapping of tasks
onto resources provide tremendous benefits

Auto-generation of
communication
Infrastructure using the
mapping information
t6
t4
Re-organize
t7
data
t8
Frequency
domain
processing
Example RCML model of a
conceptual application
6
8
Basic Definitions
H(f)
H(f)
Programming Model




SCL Task: Finest unit of computation in SCL
Task definition code: Implements the computational
part of a task in a DDL
Task graph: Defines tasks graph, by describing the
tasks and the communication between them
Mapping: Provides mapping of tasks onto devices
FFT
FFT
IFFT
IFFT
Architectural Model




SCL Device: finest granularity of computational resource that can execute one or more
task and has a unique address within a platform
SCL Platform/Node: a set of SCL compliant devices connected together by some
underlying topology into a single uniquely addressable entity in the system
SCL System: a set of platforms connected together by some underlying topology
SCL Resource graph: maintains information about all devices and platforms in the system
with their interconnection
9
Co-ordination Using SCL

Intra-device-level coordination: coordination between tasks within a single device
Two tasks mapped to a single FPGA or two SPEs of a single Cell


Intra-platform-level coordination : coordination between tasks on different devices on a
single platform
Coordination between a Nallatech board and its host processor


System-level coordination - coordination between tasks mapped on different platforms
A Nallatech board communicating with a PS3 and a Gidel board

SCL Compliance : to support coordination at above levels of hierarchy

A device is SCL compliant if



A platform is SCL compliant if




It can support communication between multiple tasks mapped onto the same device,
And provides some mechanism for specifying communication with the platform
It is composed of SCL compliant devices,
And can support communication between tasks running on different SCL-compliant devices within the platform,
And provides some mechanism for specifying communication external to the platform
A system is SCL compliant if


It is composed of SCL compliant platforms,
And can support communication between multiple SCL-compliant platforms
10
Communication using Hierarchy




Interconnect
Hierarchical addressing
Each platform has a unique “platform address” F
in the system
Each device has a unique “device address” in its
platform and hence in the system
Use of address to build communication structure

C
C
C
C
F
Cell
GPU
F
F
SCL Resource graph



System
Contains knowledge of the SCL compliant resources available
P2
in the system in hierarchical manner
P1
D1
D1
SCL parser will use info. from the graph to find
appropriate communication routines
D3
Communication constructs will be auto-stitched
in the task definition code
P3
D2
D1
P4
D1
P5
D1 Platforms
D2
D2
D2 Devices
Given a task graph of the application and a resource graph for the system,
a mapping of tasks onto devices is required to run the application
11
Quick Peek: Example
Generate random
numbers
Process numbers
A
Architecture dependent IDE
Architecture Independent
System-level Coordination
random.cpp
SCL_Init( … );
for (unsigned i=0; i < 100; i++)
{
int x = rand();
scl_send( "out1", &x, … );
}
process.handelC
SCL_Init( … );
int acc=0;
for (unsigned i=0; i < 100; i++) {
int temp;
scl_receive( "in1", &temp, … );
acc += temp;
}
systemApp.scl
Edge edge1;
Task random ( Out out1 )
{
edge1 = out1;
}
Task process ( In in1 )
{
in1 = edge1;
}
 Defines application as a task graph
 Define communication between tasks
as edges in the task graph
process.impulseC
12
edge1
B
Tasks to resource Mapping
tasks.map
Num Micro-tasks : 2
...
---------Task 1 : random
Target: x86
IDE: C++
Address:
Library:
...
----------Task 2 : process
Target: FPGA
IDE: Handel- C
Address:
...
Compilation Process

Step1 : Parse task-graph in “.scl” file




Step 2 : Reading “.map” file



Parser would extract the information from the .map file about the mappings of various
tasks
Definition of “SCL_” functions is auto-generated based on this mapping information
Step 3: Build tasks in their native build environment


Gather information about “communication edges” from .scl file
Definition for “SCL_” functions will be populated with one entry for each edge at a
later stage
In future, could also provide a script to add partially auto-generated functionality for
legacy code in existing languages
Definition for SCL functions is linked to the definition generated in previous step
Run-time service responsible for spawning tasks/(could be a manual
process in the beginning)
13
Basic Co-ordination Primitives




Identify baseline functions to support basic communication in the initial phase
Identify necessary static and run-time parameters
Focus on synchronous blocking communication based on message
passing(dominant mode of communication in MPI)
Consider other modes wherever applicable to facilitate efficient data transfer


Shared memory constructs for data movement within a platform
Streaming communication model – for systems capable of supporting this mode
Function
Call
Type
Purpose
Initialization
SCL_Init
Setup
Prepares system for communication and synchronization
Send
SCL_Send
P2P
Send data to a matching receive
Receive
SCL_Recv
P2P
Receive data from a matching send
Barrier
SCL_Barrier
Synchronization
Synchronize all the nodes together
Broadcast
SCL_Bcast
Collective
one node sends same data multiple nodes
Don’t need a separate function, can be implemented by having a
separate edge type in task graph
Gather
SCL_Gather
Collective
Each node sends a separate block of data to “root” node to provide
an all-to-one scheme
…
…
…
…
14
Challenges

Mapping from tasks to device requires a static-compile time
behavior

# of processes and communication is statically defined at compilation

Is it over restrictive? – majority of applications follow a well-behaved structure



Re-compilation required in most cases when mapping
changes or number of tasks changes – explore ways to
minimize such situations
Allow for changing the task graph by changing parameters in
.scl file in acceptable cases


Static task graphs are a well studied problem
Provision of loops to accommodate variable number of tasks in the
graph
System should allow for post-compile time scaling on
homogeneous node
15
16
SCL Parser Requirements

Basic grammar to define SCL task graph language

SCL_FILE SCL_CONSTRUCT
ARITH_OP
EDGE_ASSIGNMENT
EXPR
EDGE_DECLARATION PORT_TYPE TASK_HEADER
TASK
EDGE_TYPE
LOOP
TASK_DEFINITION
LOOP_EXPR

Build abstract syntax tree and extract edge & task information

Generate platform-specific code that implements specified
communication behavior
17
SCL Parser Design

SCL Parser
Code Generator
reads
task
readsgraph
.mapdefinition
file



Finds
Determines
all tasks
resource
mapping
Determines
communication
Implements SCL calls
in native platform code
18
Eclipse


Using Eclipse environment to develop the SCL
parser
Compatible with other HPCSA tools

Allows easier integration with other tools/entry points


Portable across most operating systems



RCML, PTP
Windows, Linux, Mac OS X
Graphical editing environment
Easy plug-in based integration
19

Eclipse-based framework for developing DomainSpecific Languages (DSL)

DSL: small specialized languages used to raise the
abstraction level of software



Removes extraneous programming details
Provides for simplified specification
Features


Allows specification of the grammar, creates a parser
Generates a complete Eclipse text editor




Syntax coloring, Syntax checking / Error markers
Code completion
Navigation, Folding
Outline, Find References
20
SCL Environment
Text Editor
Project
Files
Outline
view
Console
21
Graphviz

Converts textual descriptions of graphs into
diagrams
digraph edge_map {
P1 -> C1 [ label = "E1" ];
P2 -> P1 [ label = "E2" ];
G1 -> P1 [ label = "E3" ];
}

Aids in design and verification of task graphs

Textual description is automatically derived from user’s
design and converted into Graphviz language
22
Simple SCL example

Installation

Download self-extracting SCL plugin and extract into Eclipse plug-in
directory
Project setup


Open Eclipse->File->New Project->Xtext DSL
Wizards->SCL Project
Project specification



Describe SCL task graph in the model.scl file
Create and specify model.map file
Task graph parse & code generation


Run the .oaw file
Verification


View Graphviz diagram and verify proper task graph description
Compilation & Execution


Compile task definition code & execute application
23
Proof of Concept – Building First App

Initial emphasis: SCL coordinating computing on two different
platforms selected from heterogeneous suite (FPGA, CPU,
GPU, etc.)



Development environments




Feature FPGA as superior device technology
Multi-FPGA platform – Gidel board with a host CPU
Impulse C, VHDL – for FPGA
C++ – for processors
Multi-FPGA platform
Applications

Target tracking application using multi-fpga design
24
Target tracking – Task Graph
C1
C1
CF1
CF2
F1
F1
F2
F3
F4
BE1
E1
F2
F3
E2
F4
E3
F2/F3
C1
edge CF1, CF2 ;
task C1 ( output out1, input in1 )
{
in1 = CF2 ;
CF1 = out1 ;
}
F1
edge E1 ;
bedge BE1 ;
task F1 ( output out1, output out2,
input in1, intput in2)
{
in1 = CF1 ;
in2 = E1 ;
CF2 = out1 ;
BE1 = out2 ;
}
edge E2, E3 ;
taskId t[2] ;
loop(i=2; i<=3; i++)
(
t[$i] = $i ;
task F$i( output out1, input in1, input
in2)
{
in1 = BE1 ;
in2 = E$i ;
E$(i-1) = out1 ;
}
}
25