Project F2: Update 10/01/2007
Download
Report
Transcript Project F2: Update 10/01/2007
System Coordination Library (SCL)
Framework
Vikas Aggarwal
Rafael Garcia
Abraham Sanchez
Philips Shih
Challenges & Problems
FPGAs and other devices (eg. Cell &
GPUs) gaining popularity as accelerators
Development support for large-scale applications is lacking
Device design languages for FPGAs are migrating towards true HLL
Missing piece: System-level Coordination Library, extension to HLL
Complete lack of inter-operability, several IDEs and devices
gaining popularity in smaller domains
Lack of direct co-ordination amongst devices precludes usage as
peers in massively parallel machines
Standardization of communication, compatibility amongst different
devices is highly desirable to capture larger user-base
Lack of transition from Formulation phase to Design phase
2
Proposed Solution
Design a System Coordination Library to
facilitate coordination amongst heterogeneous
set of devices
Provide a familiar coordination/communication
interface to parallel program developers, employ MPI-like interfaces
Standardize coordination primitives across different technologies
Provide a higher level of abstraction for communication
Allows applications to be more portable across changing platforms
Provide communication based on relevant communication infrastructure
Life cycles of software are generally longer than the corresponding hardware version
Build communication from bottoms up, employing existing work and effort like MPI, GenAPI etc.
Provide a transition from Formulation phase to Design phase
Allow parallel programs to be expressed as task graphs
Provide a framework to auto-generate communication infrastructure based on
mapping of tasks to different devices
3
DARPA Study – Quick Glance
Formulation -- strategic design abstraction
F
Formulation – prediction, tradeoff analysis
Design – system coordination language
FPGA devices
x86
etc.
(e.g. Stratix-II/III, Virtex-4/5)
Cell
…
VHDL
etc.
Device design languages for FPGA
devices are migrating upward in
abstraction towards true HLL
Gedae
Impulse C
Carte C
D
AccelDSP
Design – device design languages
Design -- library reuse (modules, cores)
T
E
Translation
Execution
4
Missing piece in Design layer is System
Coordination Library, extension to HLL
Bigger Picture
Formulation enables abstract modeling of algorithms
Allows decomposition of apps into
constituent tasks
Allows automated performance prediction
for a particular algorithm decomposition
Distributor
t1
Row
FFT
Row
RowFFT
FFT
Missing Components
Several techniques have reaped benefits of
automated DSE in conventional computing
Bridging Formulation and Design phases
Providing automatically generated framework
for communication between tasks
t5
Column
FFT
Row
RowFFT
FFT
Column
FFT
Row
RowFFT
FFT
t3
Re-organize
data
Corresponding
task graph of
application
Suggested
mapping of
tasks on
resources
Row
FFT
Row
RowFFT
FFT
t2
Multi-FPGA applications still present a
major development bottleneck
Automated grouping & mapping of tasks
onto resources provide tremendous benefits
Auto-generation of
communication
Infrastructure using the
mapping information
t6
t4
Re-organize
t7
data
t8
Frequency
domain
processing
Example RCML model of a
conceptual application
6
8
Basic Definitions
H(f)
H(f)
Programming Model
SCL Task: Finest unit of computation in SCL
Task definition code: Implements the computational
part of a task in a DDL
Task graph: Defines tasks graph, by describing the
tasks and the communication between them
Mapping: Provides mapping of tasks onto devices
FFT
FFT
IFFT
IFFT
Architectural Model
SCL Device: finest granularity of computational resource that can execute one or more
task and has a unique address within a platform
SCL Platform/Node: a set of SCL compliant devices connected together by some
underlying topology into a single uniquely addressable entity in the system
SCL System: a set of platforms connected together by some underlying topology
SCL Resource graph: maintains information about all devices and platforms in the system
with their interconnection
9
Co-ordination Using SCL
Intra-device-level coordination: coordination between tasks within a single device
Two tasks mapped to a single FPGA or two SPEs of a single Cell
Intra-platform-level coordination : coordination between tasks on different devices on a
single platform
Coordination between a Nallatech board and its host processor
System-level coordination - coordination between tasks mapped on different platforms
A Nallatech board communicating with a PS3 and a Gidel board
SCL Compliance : to support coordination at above levels of hierarchy
A device is SCL compliant if
A platform is SCL compliant if
It can support communication between multiple tasks mapped onto the same device,
And provides some mechanism for specifying communication with the platform
It is composed of SCL compliant devices,
And can support communication between tasks running on different SCL-compliant devices within the platform,
And provides some mechanism for specifying communication external to the platform
A system is SCL compliant if
It is composed of SCL compliant platforms,
And can support communication between multiple SCL-compliant platforms
10
Communication using Hierarchy
Interconnect
Hierarchical addressing
Each platform has a unique “platform address” F
in the system
Each device has a unique “device address” in its
platform and hence in the system
Use of address to build communication structure
C
C
C
C
F
Cell
GPU
F
F
SCL Resource graph
System
Contains knowledge of the SCL compliant resources available
P2
in the system in hierarchical manner
P1
D1
D1
SCL parser will use info. from the graph to find
appropriate communication routines
D3
Communication constructs will be auto-stitched
in the task definition code
P3
D2
D1
P4
D1
P5
D1 Platforms
D2
D2
D2 Devices
Given a task graph of the application and a resource graph for the system,
a mapping of tasks onto devices is required to run the application
11
Quick Peek: Example
Generate random
numbers
Process numbers
A
Architecture dependent IDE
Architecture Independent
System-level Coordination
random.cpp
SCL_Init( … );
for (unsigned i=0; i < 100; i++)
{
int x = rand();
scl_send( "out1", &x, … );
}
process.handelC
SCL_Init( … );
int acc=0;
for (unsigned i=0; i < 100; i++) {
int temp;
scl_receive( "in1", &temp, … );
acc += temp;
}
systemApp.scl
Edge edge1;
Task random ( Out out1 )
{
edge1 = out1;
}
Task process ( In in1 )
{
in1 = edge1;
}
Defines application as a task graph
Define communication between tasks
as edges in the task graph
process.impulseC
12
edge1
B
Tasks to resource Mapping
tasks.map
Num Micro-tasks : 2
...
---------Task 1 : random
Target: x86
IDE: C++
Address:
Library:
...
----------Task 2 : process
Target: FPGA
IDE: Handel- C
Address:
...
Compilation Process
Step1 : Parse task-graph in “.scl” file
Step 2 : Reading “.map” file
Parser would extract the information from the .map file about the mappings of various
tasks
Definition of “SCL_” functions is auto-generated based on this mapping information
Step 3: Build tasks in their native build environment
Gather information about “communication edges” from .scl file
Definition for “SCL_” functions will be populated with one entry for each edge at a
later stage
In future, could also provide a script to add partially auto-generated functionality for
legacy code in existing languages
Definition for SCL functions is linked to the definition generated in previous step
Run-time service responsible for spawning tasks/(could be a manual
process in the beginning)
13
Basic Co-ordination Primitives
Identify baseline functions to support basic communication in the initial phase
Identify necessary static and run-time parameters
Focus on synchronous blocking communication based on message
passing(dominant mode of communication in MPI)
Consider other modes wherever applicable to facilitate efficient data transfer
Shared memory constructs for data movement within a platform
Streaming communication model – for systems capable of supporting this mode
Function
Call
Type
Purpose
Initialization
SCL_Init
Setup
Prepares system for communication and synchronization
Send
SCL_Send
P2P
Send data to a matching receive
Receive
SCL_Recv
P2P
Receive data from a matching send
Barrier
SCL_Barrier
Synchronization
Synchronize all the nodes together
Broadcast
SCL_Bcast
Collective
one node sends same data multiple nodes
Don’t need a separate function, can be implemented by having a
separate edge type in task graph
Gather
SCL_Gather
Collective
Each node sends a separate block of data to “root” node to provide
an all-to-one scheme
…
…
…
…
14
Challenges
Mapping from tasks to device requires a static-compile time
behavior
# of processes and communication is statically defined at compilation
Is it over restrictive? – majority of applications follow a well-behaved structure
Re-compilation required in most cases when mapping
changes or number of tasks changes – explore ways to
minimize such situations
Allow for changing the task graph by changing parameters in
.scl file in acceptable cases
Static task graphs are a well studied problem
Provision of loops to accommodate variable number of tasks in the
graph
System should allow for post-compile time scaling on
homogeneous node
15
16
SCL Parser Requirements
Basic grammar to define SCL task graph language
SCL_FILE SCL_CONSTRUCT
ARITH_OP
EDGE_ASSIGNMENT
EXPR
EDGE_DECLARATION PORT_TYPE TASK_HEADER
TASK
EDGE_TYPE
LOOP
TASK_DEFINITION
LOOP_EXPR
Build abstract syntax tree and extract edge & task information
Generate platform-specific code that implements specified
communication behavior
17
SCL Parser Design
SCL Parser
Code Generator
reads
task
readsgraph
.mapdefinition
file
Finds
Determines
all tasks
resource
mapping
Determines
communication
Implements SCL calls
in native platform code
18
Eclipse
Using Eclipse environment to develop the SCL
parser
Compatible with other HPCSA tools
Allows easier integration with other tools/entry points
Portable across most operating systems
RCML, PTP
Windows, Linux, Mac OS X
Graphical editing environment
Easy plug-in based integration
19
Eclipse-based framework for developing DomainSpecific Languages (DSL)
DSL: small specialized languages used to raise the
abstraction level of software
Removes extraneous programming details
Provides for simplified specification
Features
Allows specification of the grammar, creates a parser
Generates a complete Eclipse text editor
Syntax coloring, Syntax checking / Error markers
Code completion
Navigation, Folding
Outline, Find References
20
SCL Environment
Text Editor
Project
Files
Outline
view
Console
21
Graphviz
Converts textual descriptions of graphs into
diagrams
digraph edge_map {
P1 -> C1 [ label = "E1" ];
P2 -> P1 [ label = "E2" ];
G1 -> P1 [ label = "E3" ];
}
Aids in design and verification of task graphs
Textual description is automatically derived from user’s
design and converted into Graphviz language
22
Simple SCL example
Installation
Download self-extracting SCL plugin and extract into Eclipse plug-in
directory
Project setup
Open Eclipse->File->New Project->Xtext DSL
Wizards->SCL Project
Project specification
Describe SCL task graph in the model.scl file
Create and specify model.map file
Task graph parse & code generation
Run the .oaw file
Verification
View Graphviz diagram and verify proper task graph description
Compilation & Execution
Compile task definition code & execute application
23
Proof of Concept – Building First App
Initial emphasis: SCL coordinating computing on two different
platforms selected from heterogeneous suite (FPGA, CPU,
GPU, etc.)
Development environments
Feature FPGA as superior device technology
Multi-FPGA platform – Gidel board with a host CPU
Impulse C, VHDL – for FPGA
C++ – for processors
Multi-FPGA platform
Applications
Target tracking application using multi-fpga design
24
Target tracking – Task Graph
C1
C1
CF1
CF2
F1
F1
F2
F3
F4
BE1
E1
F2
F3
E2
F4
E3
F2/F3
C1
edge CF1, CF2 ;
task C1 ( output out1, input in1 )
{
in1 = CF2 ;
CF1 = out1 ;
}
F1
edge E1 ;
bedge BE1 ;
task F1 ( output out1, output out2,
input in1, intput in2)
{
in1 = CF1 ;
in2 = E1 ;
CF2 = out1 ;
BE1 = out2 ;
}
edge E2, E3 ;
taskId t[2] ;
loop(i=2; i<=3; i++)
(
t[$i] = $i ;
task F$i( output out1, input in1, input
in2)
{
in1 = BE1 ;
in2 = E$i ;
E$(i-1) = out1 ;
}
}
25