Transcript Slides

StreamX10: A Stream
Programming Framework on X10
Haitao Wei
2012-06-14
School of Computer Science at Huazhong University of Sci&Tech
Outline
1
Introduction and Background
2
COStream Programming Language
3
Stream Compilation on X10
4
Experiments
5
Conclusion and Future Work
2
Background and motivition
 Stream Programming
 A high level programming model that has been
productively applied
 Usually, depends on the specific architectures which
makes it difficult to port between different platforms
 X10
 a productive parallel programming environment
 isolates the different architecture details
 provides a flexible parallel programming abstract layer
for stream programming
 StreamX10:try to make the stream program
portable based on X10
Outline
1
Introduction and Background
2
COStream Programming Language
3
Stream Compilation on X10
4
Experiments
5
Conclusion and Future Work
4
COStream Language
 stream
 FIFO queue connecting operators
 operator
 Basic func unit—actor node in stream graph
 Multiple inputs and multiple outputs
 Window
– like pop,peek,push operations
 Init and work function
 composite
 Connected operators—subgraph of actors
 A stream program is composed of
composites
COStream and Stream Graph
Composite Main{
Composite MyOp(output Out ; input In){
graph
stream<int i> S = Source(){
state :{ int x;}
param
stream
attribute:pn
graph
init :{x=0;}
work :{
S[0].i = x;
stream<int j> Out = Averager(In){
operator
work :{
int sum=0,i;
x++;
}
window S:tumbling,count(1);
for(i=0;i<pn;i++)
composite
sum += In[i],j;
Out[0].j = (sum/pn);
}
}
streamit<int j> P = MyOp(S){
window In: sliding,count(10),count(1);
param
Out:tumbling,count(1);
pn:N
}
() as SinkOp = Sink(P){
}
}
state :{int r;}
S
work :{
r = P[0].j;
println(r);
}
window P:tumbling,count(1);
}
}
Source
push=1
peek=10
pop=1
P
Averager
pop=1
Sink
push=1
6
Outline
1
Introduction and Background
2
COStream Programming Language
3
Stream Compilation on X10
4
Experiments
5
Conclusion and Future Work
7
Compilation flow of StreamX10
Phrase
Function
Translates the COStream syntax into abstract syntax tree.
Front-end
Instantiation
Instantiates the composites hierarchically to static flattened
operators.
Constructs static stream graph from flattened operators.
Static Stream Graph
Scheduling
Calculates initialization and steady-state execution orderings of
operators.
Partitioning
Performs partitioning based on X10 parallelism models for load
balance.
Generates X10 code for COStream programs.
Code Generation
The Execution Framework
activity
activity
activity
threads
pool
Place 0
Place 1
Local buffer object
Data flow intra place




Place 2
Global buffer object
Data flow inter place
The node is partitioned between the places
Each node is mapped to an activity
The nodes use the pipeline fashion to exploit the parallelisms
The local and Global FIFO buffer are used
9
Work Partition Inter-place
10
Comp. work=10
1
2
5
Comp. work=10
5
5
2
2
5
5
2
5
Comp. work=10
2
1
10
Speedup:30/10 =3
Communication:2
Objective:Minimized Communication and Load Balance (Using Metis)
10
Global FIFO implementation
push
peek/pop
0
Producer
1 …
copy
n
0
0
1 …
n
1 …
n
copy
Place0
Consumer
Place1
Local Array
DistArray
 Each Producer/Consumer has its own local buffer
 the producer uses push operation to store the data to the local buffer
 The consumer uses peek/pop operation to fetch data from the local
buffer
 When the local buffer is full/empty is data will be copied automatically
11
X10 code in the Back-end
//main.x10 control code
public static def main( ) {
...
finish for (p in Place.places())
async at (p) {
switch(p.id){
case 0:
val a_0 = new Source_0(rc);
a_0.run();
break;
case 1:
val a_2 = new MovingAver_2(rc);
a_2.run();
break;
case 2:
val a_1 = new Sink_1(rc);
a_1.run();
break;
default: break;
}
}
…
}
//Source.x10 code
...
def work(){
...
push_Source_0_Sink_1(0).x=x;
x+=1.0;
Define the work function
pushTokens();
popTokens();
}
public def run(){
initWork();//init
Call the work function in
//initSchedule
initial and steady schedule
for(var j:Int=0;j<Source_0_init;j++)
work();
//steadySchedule
for(var i:Int=0;i<RepeatCount;i++)
for(var j:Int=0;j<Source_0_steady;j++)
work();
flush();
}
...
Spawn activities for each node at
place according to the partition
12
Outline
1
Introduction and Background
2
COStream Programming Language
3
Stream Compilation on X10
4
Experiments
5
Conclusion and Future Work
13
Experimental Platform and Benchmarks
 Platform
 Intel Xeon processor (8 cores ) 2.4 GHZ with 4GB memory
 Radhat EL5 with Linux 2.6.18
 X10 compiler and runtime used are 2.2.0
 Benchmarks
 Rewrite 11 benchmarks from StreamIt
14
The throughputs comparison
 Throughputs of 4 different configurations (NPLACE*NTHREAD=8)
 Normalized to 1 place with 8 threads
10
9
Throughput normalized to 1 place with 8 threads
8
7
• for most benchmarks, CPU utilization increases from 24%
to NTHREADS=8
NPLACES=1,
NPLACES=2, NTHREADS=4
89% ,when places varies from 1 to 4, except for the benchmark
NPLACES=4, NTHREADS=2
with low computation/communication ratio
NPLACES=8, NTHREADS=1
• benefits are little or worse when the number of places increases
from 4 to 8
6
5
4
3
2
1
0
15
Observation and Analysis
The throughput goes up when the number of
places increases. This is because that multiple
places increase the CPU utilization
Multiple places show parallelism but also bring
more communication overhead
Benchmarks with more computation workload like
DES and Serpent_full can still benefit form the
number of places increasing
16
Outline
1
Introduction and Background
2
COStream Programming Language
3
Stream Compilation on X10
4
Experiments
5
Conclusion and Future Work
17
Conclusion
 We proposed and implemented StreamX10, a
stream programming language and compilation
system on X10
 A raw partitioning optimization is proposed to
exploit the parallelisms based on X10 execution
model
 Preliminary experiment is conducted to study the
performance
18
Future Work
 How to choose the best configuration (# of places
and # of threads) automatically for each benchmark
 How to decrease the thread switching overhead by
mapping multiple nodes to the single activity
19
Acknowledgment
X10 Innovation Award founding support
QiMing Teng, Haibo Lin and David P. Grove
at IBM for their help on this research
20